1
|
Pradhan UK, Meher PK, Naha S, Sharma NK, Agarwal A, Gupta A, Parsad R. DBPMod: a supervised learning model for computational recognition of DNA-binding proteins in model organisms. Brief Funct Genomics 2024; 23:363-372. [PMID: 37651627 DOI: 10.1093/bfgp/elad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 08/09/2023] [Accepted: 08/15/2023] [Indexed: 09/02/2023] Open
Abstract
DNA-binding proteins (DBPs) play critical roles in many biological processes, including gene expression, DNA replication, recombination and repair. Understanding the molecular mechanisms underlying these processes depends on the precise identification of DBPs. In recent times, several computational methods have been developed to identify DBPs. However, because of the generic nature of the models, these models are unable to identify species-specific DBPs with higher accuracy. Therefore, a species-specific computational model is needed to predict species-specific DBPs. In this paper, we introduce the computational DBPMod method, which makes use of a machine learning approach to identify species-specific DBPs. For prediction, both shallow learning algorithms and deep learning models were used, with shallow learning models achieving higher accuracy. Additionally, the evolutionary features outperformed sequence-derived features in terms of accuracy. Five model organisms, including Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Homo sapiens and Mus musculus, were used to assess the performance of DBPMod. Five-fold cross-validation and independent test set analyses were used to evaluate the prediction accuracy in terms of area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPRC), which was found to be ~89-92% and ~89-95%, respectively. The comparative results demonstrate that the DBPMod outperforms 12 current state-of-the-art computational approaches in identifying the DBPs for all five model organisms. We further developed the web server of DBPMod to make it easier for researchers to detect DBPs and is publicly available at https://iasri-sg.icar.gov.in/dbpmod/. DBPMod is expected to be an invaluable tool for discovering DBPs, supplementing the current experimental and computational methods.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Nitesh K Sharma
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, USA
| | - Aarushi Agarwal
- Amity Institute of Biotechnology, Amity University, Noida, Uttar Pradesh 201313, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
2
|
Mendapara K. Development and evaluation of a chronic kidney disease risk prediction model using random forest. Front Genet 2024; 15:1409755. [PMID: 38993480 PMCID: PMC11236722 DOI: 10.3389/fgene.2024.1409755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 05/29/2024] [Indexed: 07/13/2024] Open
Abstract
This research aims to advance the detection of Chronic Kidney Disease (CKD) through a novel gene-based predictive model, leveraging recent breakthroughs in gene sequencing. We sourced and merged gene expression profiles of CKD-affected renal tissues from the Gene Expression Omnibus (GEO) database, classifying them into two sets for training and validation in a 7:3 ratio. The training set included 141 CKD and 33 non-CKD specimens, while the validation set had 60 and 14, respectively. The disease risk prediction model was constructed using the training dataset, while the validation dataset confirmed the model's identification capabilities. The development of our predictive model began with evaluating differentially expressed genes (DEGs) between the two groups. We isolated six genes using Lasso and random forest (RF) methods-DUSP1, GADD45B, IFI44L, IFI30, ATF3, and LYZ-which are critical in differentiating CKD from non-CKD tissues. We refined our random forest (RF) model through 10-fold cross-validation, repeated five times, to optimize the mtry parameter. The performance of our model was robust, with an average AUC of 0.979 across the folds, translating to a 91.18% accuracy. Validation tests further confirmed its efficacy, with a 94.59% accuracy and an AUC of 0.990. External validation using dataset GSE180394 yielded an AUC of 0.913, 89.83% accuracy, and a sensitivity rate of 0.889, underscoring the model's reliability. In summary, the study identified critical genetic biomarkers and successfully developed a novel disease risk prediction model for CKD. This model can serve as a valuable tool for CKD disease risk assessment and contribute significantly to CKD identification.
Collapse
Affiliation(s)
- Krish Mendapara
- Faculty of Health Sciences, Queen's University, Kingston, ON, Canada
| |
Collapse
|
3
|
Pradhan UK, Meher PK, Naha S, Das R, Gupta A, Parsad R. ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins. Protein Sci 2024; 33:e5015. [PMID: 38747369 PMCID: PMC11094783 DOI: 10.1002/pro.5015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 04/18/2024] [Accepted: 04/21/2024] [Indexed: 05/19/2024]
Abstract
Prokaryotic DNA binding proteins (DBPs) play pivotal roles in governing gene regulation, DNA replication, and various cellular functions. Accurate computational models for predicting prokaryotic DBPs hold immense promise in accelerating the discovery of novel proteins, fostering a deeper understanding of prokaryotic biology, and facilitating the development of therapeutics targeting for potential disease interventions. However, existing generic prediction models often exhibit lower accuracy in predicting prokaryotic DBPs. To address this gap, we introduce ProkDBP, a novel machine learning-driven computational model for prediction of prokaryotic DBPs. For prediction, a total of nine shallow learning algorithms and five deep learning models were utilized, with the shallow learning models demonstrating higher performance metrics compared to their deep learning counterparts. The light gradient boosting machine (LGBM), coupled with evolutionarily significant features selected via random forest variable importance measure (RF-VIM) yielded the highest five-fold cross-validation accuracy. The model achieved the highest auROC (0.9534) and auPRC (0.9575) among the 14 machine learning models evaluated. Additionally, ProkDBP demonstrated substantial performance with an independent dataset, exhibiting higher values of auROC (0.9332) and auPRC (0.9371). Notably, when benchmarked against several cutting-edge existing models, ProkDBP showcased superior predictive accuracy. Furthermore, to promote accessibility and usability, ProkDBP (https://iasri-sg.icar.gov.in/prokdbp/) is available as an online prediction tool, enabling free access to interested users. This tool stands as a significant contribution, enhancing the repertoire of resources for accurate and efficient prediction of prokaryotic DBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Prabina Kumar Meher
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Sanchita Naha
- Division of Computer ApplicationsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ritwika Das
- Division of Agricultural BioinformaticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ajit Gupta
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Rajender Parsad
- ICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| |
Collapse
|
4
|
Karimi-Fard A, Saidi A, TohidFar M, Emami SN. Novel candidate genes for environmental stresses response in Synechocystis sp. PCC 6803 revealed by machine learning algorithms. Braz J Microbiol 2024; 55:1219-1229. [PMID: 38705959 PMCID: PMC11153407 DOI: 10.1007/s42770-024-01338-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 04/03/2024] [Indexed: 05/07/2024] Open
Abstract
Cyanobacteria have developed acclimation strategies to adapt to harsh environments, making them a model organism. Understanding the molecular mechanisms of tolerance to abiotic stresses can help elucidate how cells change their gene expression patterns in response to stress. Recent advances in sequencing techniques and bioinformatics analysis methods have led to the discovery of many genes involved in stress response in organisms. The Synechocystis sp. PCC 6803 is a suitable microorganism for studying transcriptome response under environmental stress. Therefore, for the first time, we employed two effective feature selection techniques namely and support vector machine recursive feature elimination (SVM-RFE) and LASSO (Least Absolute Shrinkage Selector Operator) to pinpoint the crucial genes responsive to environmental stresses in Synechocystis sp. PCC 6803. We applied these algorithms of machine learning to analyze the transcriptomic data of Synechocystis sp. PCC 6803 under distinct conditions, encompassing light, salt and iron stress conditions. Seven candidate genes namely sll1862, slr0650, sll0760, slr0091, ssl3044, slr1285, and slr1687 were selected by both LASSO and SVM-RFE algorithms. RNA-seq analysis was performed to validate the efficiency of our feature selection approach in selecting the most important genes. The RNA-seq analysis revealed significantly high expression for five genes namely sll1862, slr1687, ssl3044, slr1285, and slr0650 under ion stress condition. Among these five genes, ssl3044 and slr0650 could be introduced as new potential candidate genes for further confirmatory genetic studies, to determine their roles in their response to abiotic stresses.
Collapse
Affiliation(s)
- Abbas Karimi-Fard
- Department of Cell and Molecular Biology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran
| | - Abbas Saidi
- Department of Cell and Molecular Biology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran.
| | - Masoud TohidFar
- Department of Cell and Molecular Biology, Faculty of Life Sciences and Biotechnology, Shahid Beheshti University, Tehran, Iran.
| | - Seyedeh Noushin Emami
- Department of Molecular Biosciences, Wenner-Gren Institute, Stockholm University, Stockholm, Sweden
| |
Collapse
|
5
|
Bai Z, Bartelo N, Aslam M, Murphy EA, Hale CR, Blachere NE, Parveen S, Spolaore E, DiCarlo E, Gravallese EM, Smith MH, Frank MO, Jiang CS, Zhang H, Pyrgaki C, Lewis MJ, Sikandar S, Pitzalis C, Lesnak JB, Mazhar K, Price TJ, Malfait AM, Miller RE, Zhang F, Goodman S, Darnell RB, Wang F, Orange DE. Synovial fibroblast gene expression is associated with sensory nerve growth and pain in rheumatoid arthritis. Sci Transl Med 2024; 16:eadk3506. [PMID: 38598614 DOI: 10.1126/scitranslmed.adk3506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 03/21/2024] [Indexed: 04/12/2024]
Abstract
It has been presumed that rheumatoid arthritis (RA) joint pain is related to inflammation in the synovium; however, recent studies reveal that pain scores in patients do not correlate with synovial inflammation. We developed a machine-learning approach (graph-based gene expression module identification or GbGMI) to identify an 815-gene expression module associated with pain in synovial biopsy samples from patients with established RA who had limited synovial inflammation at arthroplasty. We then validated this finding in an independent cohort of synovial biopsy samples from patients who had early untreated RA with little inflammation. Single-cell RNA sequencing analyses indicated that most of these 815 genes were most robustly expressed by lining layer synovial fibroblasts. Receptor-ligand interaction analysis predicted cross-talk between human lining layer fibroblasts and human dorsal root ganglion neurons expressing calcitonin gene-related peptide (CGRP+). Both RA synovial fibroblast culture supernatant and netrin-4, which is abundantly expressed by lining fibroblasts and was within the GbGMI-identified pain-associated gene module, increased the branching of pain-sensitive murine CGRP+ dorsal root ganglion neurons in vitro. Imaging of solvent-cleared synovial tissue with little inflammation from humans with RA revealed CGRP+ pain-sensing neurons encasing blood vessels growing into synovial hypertrophic papilla. Together, these findings support a model whereby synovial lining fibroblasts express genes associated with pain that enhance the growth of pain-sensing neurons into regions of synovial hypertrophy in RA.
Collapse
Affiliation(s)
- Zilong Bai
- Weill Cornell Medicine, New York, NY 10065, USA
| | | | | | | | - Caryn R Hale
- Rockefeller University, New York, NY 10065, USA
- Memorial Sloan Kettering Cancer Center, New York, NY 10065, USA
| | - Nathalie E Blachere
- Rockefeller University, New York, NY 10065, USA
- Howard Hughes Medical Institute, Rockefeller University, New York, NY 10065, USA
| | | | | | | | | | | | | | | | | | | | - Myles J Lewis
- Queen Mary University of London & NIHR BRC Barts Health NHS Trust, London E1 4NS, UK
| | - Shafaq Sikandar
- Queen Mary University of London & NIHR BRC Barts Health NHS Trust, London E1 4NS, UK
| | - Costantino Pitzalis
- Queen Mary University of London & NIHR BRC Barts Health NHS Trust, London E1 4NS, UK
- Department of Biomedical Sciences, Humanitas University & IRCC Humanitas Research Hospital, Milan 20072, Italy
| | | | | | | | | | | | - Fan Zhang
- University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Susan Goodman
- Hospital for Special Surgery, New York, NY 10021, USA
| | - Robert B Darnell
- Rockefeller University, New York, NY 10065, USA
- Howard Hughes Medical Institute, Rockefeller University, New York, NY 10065, USA
| | - Fei Wang
- Weill Cornell Medicine, New York, NY 10065, USA
| | - Dana E Orange
- Rockefeller University, New York, NY 10065, USA
- Hospital for Special Surgery, New York, NY 10021, USA
| |
Collapse
|
6
|
Mukherjee A, Abraham S, Singh A, Balaji S, Mukunthan KS. From Data to Cure: A Comprehensive Exploration of Multi-omics Data Analysis for Targeted Therapies. Mol Biotechnol 2024:10.1007/s12033-024-01133-6. [PMID: 38565775 DOI: 10.1007/s12033-024-01133-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 02/27/2024] [Indexed: 04/04/2024]
Abstract
In the dynamic landscape of targeted therapeutics, drug discovery has pivoted towards understanding underlying disease mechanisms, placing a strong emphasis on molecular perturbations and target identification. This paradigm shift, crucial for drug discovery, is underpinned by big data, a transformative force in the current era. Omics data, characterized by its heterogeneity and enormity, has ushered biological and biomedical research into the big data domain. Acknowledging the significance of integrating diverse omics data strata, known as multi-omics studies, researchers delve into the intricate interrelationships among various omics layers. This review navigates the expansive omics landscape, showcasing tailored assays for each molecular layer through genomes to metabolomes. The sheer volume of data generated necessitates sophisticated informatics techniques, with machine-learning (ML) algorithms emerging as robust tools. These datasets not only refine disease classification but also enhance diagnostics and foster the development of targeted therapeutic strategies. Through the integration of high-throughput data, the review focuses on targeting and modeling multiple disease-regulated networks, validating interactions with multiple targets, and enhancing therapeutic potential using network pharmacology approaches. Ultimately, this exploration aims to illuminate the transformative impact of multi-omics in the big data era, shaping the future of biological research.
Collapse
Affiliation(s)
- Arnab Mukherjee
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | - Suzanna Abraham
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | - Akshita Singh
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | - S Balaji
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | - K S Mukunthan
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India.
| |
Collapse
|
7
|
Okimoto LYS, Mendonca-Neto R, Nakamura FG, Nakamura EF, Fenyö D, Silva CT. Few-shot genes selection: subset of PAM50 genes for breast cancer subtypes classification. BMC Bioinformatics 2024; 25:92. [PMID: 38429657 PMCID: PMC10908178 DOI: 10.1186/s12859-024-05715-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/21/2024] [Indexed: 03/03/2024] Open
Abstract
BACKGROUND In recent years, researchers have made significant strides in understanding the heterogeneity of breast cancer and its various subtypes. However, the wealth of genomic and proteomic data available today necessitates efficient frameworks, instruments, and computational tools for meaningful analysis. Despite its success as a prognostic tool, the PAM50 gene signature's reliance on many genes presents challenges in terms of cost and complexity. Consequently, there is a need for more efficient methods to classify breast cancer subtypes using a reduced gene set accurately. RESULTS This study explores the potential of achieving precise breast cancer subtype categorization using a reduced gene set derived from the PAM50 gene signature. By employing a "Few-Shot Genes Selection" method, we randomly select smaller subsets from PAM50 and evaluate their performance using metrics and a linear model, specifically the Support Vector Machine (SVM) classifier. In addition, we aim to assess whether a more compact gene set can maintain performance while simplifying the classification process. Our findings demonstrate that certain reduced gene subsets can perform comparable or superior to the full PAM50 gene signature. CONCLUSIONS The identified gene subsets, with 36 genes, have the potential to contribute to the development of more cost-effective and streamlined diagnostic tools in breast cancer research and clinical settings.
Collapse
Affiliation(s)
- Leandro Y S Okimoto
- Institute of Computing, Universidade Federal do Amazonas, Manaus, BR, Brazil.
| | - Rayol Mendonca-Neto
- Institute of Computing, Universidade Federal do Amazonas, Manaus, BR, Brazil
| | - Fabíola G Nakamura
- Institute of Computing, Universidade Federal do Amazonas, Manaus, BR, Brazil
| | - Eduardo F Nakamura
- Institute of Computing, Universidade Federal do Amazonas, Manaus, BR, Brazil
| | | | | |
Collapse
|
8
|
Turfan D, Altunkaynak B, Yeniay Ö. A New Filter Approach Based on Effective Ranges for Classification of Gene Expression Data. BIG DATA 2023. [PMID: 37668992 DOI: 10.1089/big.2022.0086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/06/2023]
Abstract
Over the years, many studies have been carried out to reduce and eliminate the effects of diseases on human health. Gene expression data sets play a critical role in diagnosing and treating diseases. These data sets consist of thousands of genes and a small number of sample sizes. This situation creates the curse of dimensionality and it becomes problematic to analyze such data sets. One of the most effective strategies to solve this problem is feature selection methods. Feature selection is a preprocessing step to improve classification performance by selecting the most relevant and informative features while increasing the accuracy of classification. In this article, we propose a new statistically based filter method for the feature selection approach named Effective Range-based Feature Selection Algorithm (FSAER). As an extension of the previous Effective Range based Gene Selection (ERGS) and Improved Feature Selection based on Effective Range (IFSER) algorithms, our novel method includes the advantages of both methods while taking into account the disjoint area. To illustrate the efficacy of the proposed algorithm, the experiments have been conducted on six benchmark gene expression data sets. The results of the FSAER and the other filter methods have been compared in terms of classification accuracies to demonstrate the effectiveness of the proposed method. For classification methods, support vector machines, naive Bayes classifier, and k-nearest neighbor algorithms have been used.
Collapse
Affiliation(s)
- Derya Turfan
- Department of Statistics, Hacettepe University, Ankara, Turkey
| | | | - Özgür Yeniay
- Department of Statistics, Hacettepe University, Ankara, Turkey
| |
Collapse
|
9
|
Hybrid Filter and Genetic Algorithm-Based Feature Selection for Improving Cancer Classification in High-Dimensional Microarray Data. Processes (Basel) 2023. [DOI: 10.3390/pr11020562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023] Open
Abstract
The advancements in intelligent systems have contributed tremendously to the fields of bioinformatics, health, and medicine. Intelligent classification and prediction techniques have been used in studying microarray datasets, which store information about the ways used to express the genes, to assist greatly in diagnosing chronic diseases, such as cancer in its earlier stage, which is important and challenging. However, the high-dimensionality and noisy nature of the microarray data lead to slow performance and low cancer classification accuracy while using machine learning techniques. In this paper, a hybrid filter-genetic feature selection approach has been proposed to solve the high-dimensional microarray datasets problem which ultimately enhances the performance of cancer classification precision. First, the filter feature selection methods including information gain, information gain ratio, and Chi-squared are applied in this study to select the most significant features of cancerous microarray datasets. Then, a genetic algorithm has been employed to further optimize and enhance the selected features in order to improve the proposed method’s capability for cancer classification. To test the proficiency of the proposed scheme, four cancerous microarray datasets were used in the study—this primarily included breast, lung, central nervous system, and brain cancer datasets. The experimental results show that the proposed hybrid filter-genetic feature selection approach achieved better performance of several common machine learning methods in terms of Accuracy, Recall, Precision, and F-measure.
Collapse
|
10
|
Alharbi F, Vakanski A. Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. Bioengineering (Basel) 2023; 10:bioengineering10020173. [PMID: 36829667 PMCID: PMC9952758 DOI: 10.3390/bioengineering10020173] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 01/24/2023] [Accepted: 01/26/2023] [Indexed: 01/31/2023] Open
Abstract
Cancer is a term that denotes a group of diseases caused by the abnormal growth of cells that can spread in different parts of the body. According to the World Health Organization (WHO), cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer, as it is indicative of the biochemical processes in tissue and cells, as well as the genetic characteristics of an organism. Deoxyribonucleic acid (DNA) microarrays and ribonucleic acid (RNA)-sequencing methods for gene expression data allow quantifying the expression levels of genes and produce valuable data for computational analysis. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods. Both conventional and deep learning-based approaches are reviewed, with an emphasis on the application of deep learning models due to their comparative advantages for identifying gene patterns that are distinctive for various types of cancers. Relevant works that employ the most commonly used deep neural network architectures are covered, including multi-layer perceptrons, as well as convolutional, recurrent, graph, and transformer networks. This survey also presents an overview of the data collection methods for gene expression analysis and lists important datasets that are commonly used for supervised machine learning for this task. Furthermore, we review pertinent techniques for feature engineering and data preprocessing that are typically used to handle the high dimensionality of gene expression data, caused by a large number of genes present in data samples. The paper concludes with a discussion of future research directions for machine learning-based gene expression analysis for cancer classification.
Collapse
|
11
|
Identification of Biclusters in Huntington’s Disease Dataset Using a New Variant of Grey Wolf Optimizer. JOURNAL OF THE INSTITUTION OF ENGINEERS (INDIA): SERIES B 2022. [PMCID: PMC9640792 DOI: 10.1007/s40031-022-00815-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Biclustering is a useful technique to identify subgroups of genes that have same type of expression characteristics with respect to some conditions in microarray gene expression data. This is a complex problem where meta-heuristic algorithms are more suitable to explore the large datasets for finding biclusters of optimal quality. In this paper, there is an attempt for the first time to choose biclusters with respect to shifting and scaling behaviors using Huntington's disease database applying Grey Wolf Optimizer (GWO) along with its proposed modified version namely, Enhanced Search Grey Wolf Optimizer (ES-GWO). ES-GWO incorporates strategies that make the search process more balanced with respect to exploration and exploitation compared to the state-of-the-art techniques (GWO, RM-GWO). The efficacy of ES-GWO is validated on several benchmark instances and compared with the existing meta-heuristic techniques (PSO, HS, Firefly, ABC and DE) based on convergence quality. Finally, from 100 biclusters produced by ES-GWO top 5 were separated. 7 genes common in those 5 biclusters have proved to be biologically significant.
Collapse
|
12
|
Guryleva MV, Penzar DD, Chistyakov DV, Mironov AA, Favorov AV, Sergeeva MG. Investigation of the Role of PUFA Metabolism in Breast Cancer Using a Rank-Based Random Forest Algorithm. Cancers (Basel) 2022; 14:cancers14194663. [PMID: 36230586 PMCID: PMC9562210 DOI: 10.3390/cancers14194663] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Revised: 09/15/2022] [Accepted: 09/21/2022] [Indexed: 11/16/2022] Open
Abstract
Simple Summary Polyunsaturated fatty acids (PUFAs) and their derivatives, oxylipins, are a constant focus of cancer research due to the relationship between cancer and processes of energy metabolism and inflammation, where a PUFA system is an active player. Only recently have methods been developed that allow for studying such complex systems. Using the Rank-based Random Forest (RF) model, we show that PUFA metabolism genes are critical for the pathogenesis of breast cancer (BC); BC subtypes differ in PUFA metabolism gene expression. The enrichment of BC subtypes with various genes associated with oxylipin signaling pathways indicates a different contribution of these compounds to the biology of subtypes. Abstract Polyunsaturated fatty acid (PUFA) metabolism is currently a focus in cancer research due to PUFAs functioning as structural components of the membrane matrix, as fuel sources for energy production, and as sources of secondary messengers, so called oxylipins, important players of inflammatory processes. Although breast cancer (BC) is the leading cause of cancer death among women worldwide, no systematic study of PUFA metabolism as a system of interrelated processes in this disease has been carried out. Here, we implemented a Boruta-based feature selection algorithm to determine the list of most important PUFA metabolism genes altered in breast cancer tissues compared with in normal tissues. A rank-based Random Forest (RF) model was built on the selected gene list (33 genes) and applied to predict the cancer phenotype to ascertain the PUFA genes involved in cancerogenesis. It showed high-performance of dichotomic classification (balanced accuracy of 0.94, ROC AUC 0.99) We also retrieved a list of the important PUFA genes (46 genes) that differed between molecular subtypes at the level of breast cancer molecular subtypes. The balanced accuracy of the classification model built on the specified genes was 0.82, while the ROC AUC for the sensitivity analysis was 0.85. Specific patterns of PUFA metabolic changes were obtained for each molecular subtype of breast cancer. These results show evidence that (1) PUFA metabolism genes are critical for the pathogenesis of breast cancer; (2) BC subtypes differ in PUFA metabolism genes expression; and (3) the lists of genes selected in the models are enriched with genes involved in the metabolism of signaling lipids.
Collapse
Affiliation(s)
- Mariia V. Guryleva
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119234 Moscow, Russia
| | - Dmitry D. Penzar
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119234 Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
| | - Dmitry V. Chistyakov
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, 119992 Moscow, Russia
- Correspondence: ; Tel.: +7-495-939-4332
| | - Andrey A. Mironov
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119234 Moscow, Russia
- Kharkevich Institute of Information Transmission Problems, Russian Academy of Sciences, 127051 Moscow, Russia
| | - Alexander V. Favorov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Marina G. Sergeeva
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, 119992 Moscow, Russia
| |
Collapse
|
13
|
Zanella L, Facco P, Bezzo F, Cimetta E. Feature Selection and Molecular Classification of Cancer Phenotypes: A Comparative Study. Int J Mol Sci 2022; 23:ijms23169087. [PMID: 36012350 PMCID: PMC9408964 DOI: 10.3390/ijms23169087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 08/09/2022] [Accepted: 08/11/2022] [Indexed: 11/16/2022] Open
Abstract
The classification of high dimensional gene expression data is key to the development of effective diagnostic and prognostic tools. Feature selection involves finding the best subset with the highest power in predicting class labels. Here, we conducted a comparative study focused on different combinations of feature selectors (Chi-Squared, mRMR, Relief-F, and Genetic Algorithms) and classification learning algorithms (Random Forests, PLS-DA, SVM, Regularized Logistic/Multinomial Regression, and kNN) to identify those with the best predictive capacity. The performance of each combination is evaluated through an empirical study on three benchmark cancer-related microarray datasets. Our results first suggest that the quality of the data relevant to the target classes is key for the successful classification of cancer phenotypes. We also proved that, for a given classification learning algorithm and dataset, all filters have a similar performance. Interestingly, filters achieve comparable or even better results with respect to the GA-based wrappers, while also being easier and faster to implement. Taken together, our findings suggest that simple, well-established feature selectors in combination with optimized classifiers guarantee good performances, with no need for complicated and computationally demanding methodologies.
Collapse
Affiliation(s)
- Luca Zanella
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Pierantonio Facco
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Fabrizio Bezzo
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
| | - Elisa Cimetta
- Department of Industrial Engineering (DII), University of Padova, 35131 Padova, Italy
- Fondazione Istituto di Ricerca Pediatrica Città della Speranza (IRP), 35127 Padova, Italy
- Correspondence:
| |
Collapse
|
14
|
Virtual reality for the observation of oncology models (VROOM): immersive analytics for oncology patient cohorts. Sci Rep 2022; 12:11337. [PMID: 35790803 PMCID: PMC9256599 DOI: 10.1038/s41598-022-15548-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 06/24/2022] [Indexed: 11/08/2022] Open
Abstract
The significant advancement of inexpensive and portable virtual reality (VR) and augmented reality devices has re-energised the research in the immersive analytics field. The immersive environment is different from a traditional 2D display used to analyse 3D data as it provides a unified environment that supports immersion in a 3D scene, gestural interaction, haptic feedback and spatial audio. Genomic data analysis has been used in oncology to understand better the relationship between genetic profile, cancer type, and treatment option. This paper proposes a novel immersive analytics tool for cancer patient cohorts in a virtual reality environment, virtual reality to observe oncology data models. We utilise immersive technologies to analyse the gene expression and clinical data of a cohort of cancer patients. Various machine learning algorithms and visualisation methods have also been deployed in VR to enhance the data interrogation process. This is supported with established 2D visual analytics and graphical methods in bioinformatics, such as scatter plots, descriptive statistical information, linear regression, box plot and heatmap into our visualisation. Our approach allows the clinician to interrogate the information that is familiar and meaningful to them while providing them immersive analytics capabilities to make new discoveries toward personalised medicine.
Collapse
|
15
|
EGFAFS: A Novel Feature Selection Algorithm Based on Explosion Gravitation Field Algorithm. ENTROPY 2022; 24:e24070873. [PMID: 35885095 PMCID: PMC9322764 DOI: 10.3390/e24070873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 06/15/2022] [Accepted: 06/22/2022] [Indexed: 02/04/2023]
Abstract
Feature selection (FS) is a vital step in data mining and machine learning, especially for analyzing the data in high-dimensional feature space. Gene expression data usually consist of a few samples characterized by high-dimensional feature space. As a result, they are not suitable to be processed by simple methods, such as the filter-based method. In this study, we propose a novel feature selection algorithm based on the Explosion Gravitation Field Algorithm, called EGFAFS. To reduce the dimensions of the feature space to acceptable dimensions, we constructed a recommended feature pool by a series of Random Forests based on the Gini index. Furthermore, by paying more attention to the features in the recommended feature pool, we can find the best subset more efficiently. To verify the performance of EGFAFS for FS, we tested EGFAFS on eight gene expression datasets compared with four heuristic-based FS methods (GA, PSO, SA, and DE) and four other FS methods (Boruta, HSICLasso, DNN-FS, and EGSG). The results show that EGFAFS has better performance for FS on gene expression data in terms of evaluation metrics, having more than the other eight FS algorithms. The genes selected by EGFAGS play an essential role in the differential co-expression network and some biological functions further demonstrate the success of EGFAFS for solving FS problems on gene expression data.
Collapse
|
16
|
Mirzaei G. GraphChrom: A Novel Graph-Based Framework for Cancer Classification Using Chromosomal Rearrangement Endpoints. Cancers (Basel) 2022; 14:cancers14133060. [PMID: 35804833 PMCID: PMC9265123 DOI: 10.3390/cancers14133060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Revised: 06/06/2022] [Accepted: 06/18/2022] [Indexed: 11/16/2022] Open
Abstract
Chromosomal rearrangements are generally a consequence of improperly repaired double-strand breaks in DNA. These genomic aberrations can be a driver of cancers. Here, we investigated the use of chromosomal rearrangements for classification of cancer tumors and the effect of inter- and intrachromosomal rearrangements in cancer classification. We used data from the Catalogue of Somatic Mutations in Cancer (COSMIC) for breast, pancreatic, and prostate cancers, for which the COSMIC dataset reports the highest number of chromosomal aberrations. We developed a framework known as GraphChrom for cancer classification. GraphChrom was developed using a graph neural network which models the complex structure of chromosomal aberrations (CA) and provides local connectivity between the aberrations. The proposed framework illustrates three important contributions to the field of cancers. Firstly, it successfully classifies cancer types and subtypes. Secondly, it evolved into a novel data extraction technique which can be used to extract more informative graphs (informative aberrations associated with a sample); and thirdly, it predicts that interCAs (rearrangements between two or more chromosomes) are more effective in cancer prediction than intraCAs (rearrangements within the same chromosome), although intraCAs are three times more likely to occur than intraCAs.
Collapse
Affiliation(s)
- Golrokh Mirzaei
- Department of Computer Science and Engineering, Ohio State University, Marion, OH 403302, USA
| |
Collapse
|
17
|
Jha A, Quesnel-Vallières M, Wang D, Thomas-Tikhonenko A, Lynch KW, Barash Y. Identifying common transcriptome signatures of cancer by interpreting deep learning models. Genome Biol 2022; 23:117. [PMID: 35581644 PMCID: PMC9112525 DOI: 10.1186/s13059-022-02681-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Accepted: 04/27/2022] [Indexed: 01/01/2023] Open
Abstract
Background Cancer is a set of diseases characterized by unchecked cell proliferation and invasion of surrounding tissues. The many genes that have been genetically associated with cancer or shown to directly contribute to oncogenesis vary widely between tumor types, but common gene signatures that relate to core cancer pathways have also been identified. It is not clear, however, whether there exist additional sets of genes or transcriptomic features that are less well known in cancer biology but that are also commonly deregulated across several cancer types. Results Here, we agnostically identify transcriptomic features that are commonly shared between cancer types using 13,461 RNA-seq samples from 19 normal tissue types and 18 solid tumor types to train three feed-forward neural networks, based either on protein-coding gene expression, lncRNA expression, or splice junction use, to distinguish between normal and tumor samples. All three models recognize transcriptome signatures that are consistent across tumors. Analysis of attribution values extracted from our models reveals that genes that are commonly altered in cancer by expression or splicing variations are under strong evolutionary and selective constraints. Importantly, we find that genes composing our cancer transcriptome signatures are not frequently affected by mutations or genomic alterations and that their functions differ widely from the genes genetically associated with cancer. Conclusions Our results highlighted that deregulation of RNA-processing genes and aberrant splicing are pervasive features on which core cancer pathways might converge across a large array of solid tumor types. Supplementary Information The online version contains supplementary material available at (10.1186/s13059-022-02681-3).
Collapse
Affiliation(s)
- Anupama Jha
- Department of Computer and Information Science, School of Engineering and Applied Science, Philadelphia, USA.
| | - Mathieu Quesnel-Vallières
- Department of Genetics, Philadelphia, USA. .,Department of Biochemistry and Biophysics, Philadelphia, USA.
| | - David Wang
- Department of Genetics, Philadelphia, USA
| | - Andrei Thomas-Tikhonenko
- Department of Pathology and Laboratory Medicine, Philadelphia, USA.,Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA.,Division of Cancer Pathobiology, Children's Hospital of Philadelphia, Philadelphia, USA
| | - Kristen W Lynch
- Department of Biochemistry and Biophysics, Philadelphia, USA
| | - Yoseph Barash
- Department of Computer and Information Science, School of Engineering and Applied Science, Philadelphia, USA. .,Department of Genetics, Philadelphia, USA.
| |
Collapse
|
18
|
The ability to classify patients based on gene-expression data varies by algorithm and performance metric. PLoS Comput Biol 2022; 18:e1009926. [PMID: 35275931 PMCID: PMC8942277 DOI: 10.1371/journal.pcbi.1009926] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Revised: 03/23/2022] [Accepted: 02/15/2022] [Indexed: 01/02/2023] Open
Abstract
By classifying patients into subgroups, clinicians can provide more effective care than using a uniform approach for all patients. Such subgroups might include patients with a particular disease subtype, patients with a good (or poor) prognosis, or patients most (or least) likely to respond to a particular therapy. Transcriptomic measurements reflect the downstream effects of genomic and epigenomic variations. However, high-throughput technologies generate thousands of measurements per patient, and complex dependencies exist among genes, so it may be infeasible to classify patients using traditional statistical models. Machine-learning classification algorithms can help with this problem. However, hundreds of classification algorithms exist-and most support diverse hyperparameters-so it is difficult for researchers to know which are optimal for gene-expression biomarkers. We performed a benchmark comparison, applying 52 classification algorithms to 50 gene-expression datasets (143 class variables). We evaluated algorithms that represent diverse machine-learning methodologies and have been implemented in general-purpose, open-source, machine-learning libraries. When available, we combined clinical predictors with gene-expression data. Additionally, we evaluated the effects of performing hyperparameter optimization and feature selection using nested cross validation. Kernel- and ensemble-based algorithms consistently outperformed other types of classification algorithms; however, even the top-performing algorithms performed poorly in some cases. Hyperparameter optimization and feature selection typically improved predictive performance, and univariate feature-selection algorithms typically outperformed more sophisticated methods. Together, our findings illustrate that algorithm performance varies considerably when other factors are held constant and thus that algorithm selection is a critical step in biomarker studies.
Collapse
|
19
|
Combination of Reduction Detection Using TOPSIS for Gene Expression Data Analysis. BIG DATA AND COGNITIVE COMPUTING 2022. [DOI: 10.3390/bdcc6010024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
In high-dimensional data analysis, Feature Selection (FS) is one of the most fundamental issues in machine learning and requires the attention of researchers. These datasets are characterized by huge space due to a high number of features, out of which only a few are significant for analysis. Thus, significant feature extraction is crucial. There are various techniques available for feature selection; among them, the filter techniques are significant in this community, as they can be used with any type of learning algorithm and drastically lower the running time of optimization algorithms and improve the performance of the model. Furthermore, the application of a filter approach depends on the characteristics of the dataset as well as on the machine learning model. Thus, to avoid these issues in this research, a combination of feature reduction (CFR) is considered designing a pipeline of filter approaches for high-dimensional microarray data classification. Considering four filter approaches, sixteen combinations of pipelines are generated. The feature subset is reduced in different levels, and ultimately, the significant feature set is evaluated. The pipelined filter techniques are Correlation-Based Feature Selection (CBFS), Chi-Square Test (CST), Information Gain (InG), and Relief Feature Selection (RFS), and the classification techniques are Decision Tree (DT), Logistic Regression (LR), Random Forest (RF), and k-Nearest Neighbor (k-NN). The performance of CFR depends highly on the datasets as well as on the classifiers. Thereafter, the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) method is used for ranking all reduction combinations and evaluating the superior filter combination among all.
Collapse
|
20
|
Rout S, Mallick PK, Mishra D. DRBF-DS: Double RBF Kernel-Based Deep Sampling with CNNs to Handle Complex Imbalanced Datasets. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING 2022. [DOI: 10.1007/s13369-021-06480-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
21
|
Wang A, Liu H, Yang J, Chen G. Ensemble feature selection for stable biomarker identification and cancer classification from microarray expression data. Comput Biol Med 2022; 142:105208. [PMID: 35016102 DOI: 10.1016/j.compbiomed.2021.105208] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 12/19/2021] [Accepted: 12/31/2021] [Indexed: 01/31/2023]
Abstract
Microarray technology facilitates the simultaneous measurement of expression of tens of thousands of genes and enables us to study cancers and tumors at the molecular level. Because microarray data are typically characterized by small sample size and high dimensionality, accurate and stable feature selection is thus of fundamental importance to the diagnostic accuracy and deep understanding of disease mechanism. Hence, we in this study present an ensemble feature selection framework to improve the discrimination and stability of finally selected features. Specifically, we utilize sampling techniques to obtain multiple sampled datasets, from each of which we use a base feature selector to select a subset of features. Afterwards, we develop two aggregation strategies to combine multiple feature subsets into one set. Finally, comparative experiments are conducted on four publicly available microarray datasets covering both binary and multi-class cases in terms of classification accuracy and three stability metrics. Results show that the proposed method obtains better stability scores and achieves comparable to and even better classification performance than its competitors.
Collapse
Affiliation(s)
- Aiguo Wang
- School of Electronic Information Engineering, Foshan University, Foshan, China.
| | - Huancheng Liu
- School of Electronic Information Engineering, Foshan University, Foshan, China.
| | - Jing Yang
- School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China.
| | - Guilin Chen
- School of Computer and Information Engineering, Chuzhou University, Chuzhou, China.
| |
Collapse
|
22
|
Mori Y, Yokota H, Hoshino I, Iwatate Y, Wakamatsu K, Uno T, Suyari H. Deep learning-based gene selection in comprehensive gene analysis in pancreatic cancer. Sci Rep 2021; 11:16521. [PMID: 34389782 PMCID: PMC8363643 DOI: 10.1038/s41598-021-95969-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 07/29/2021] [Indexed: 12/14/2022] Open
Abstract
The selection of genes that are important for obtaining gene expression data is challenging. Here, we developed a deep learning-based feature selection method suitable for gene selection. Our novel deep learning model includes an additional feature-selection layer. After model training, the units in this layer with high weights correspond to the genes that worked effectively in the processing of the networks. Cancer tissue samples and adjacent normal pancreatic tissue samples were collected from 13 patients with pancreatic ductal adenocarcinoma during surgery and subsequently frozen. After processing, gene expression data were extracted from the specimens using RNA sequencing. Task 1 for the model training was to discriminate between cancerous and normal pancreatic tissue in six patients. Task 2 was to discriminate between patients with pancreatic cancer (n = 13) who survived for more than one year after surgery. The most frequently selected genes were ACACB, ADAMTS6, NCAM1, and CADPS in Task 1, and CD1D, PLA2G16, DACH1, and SOWAHA in Task 2. According to The Cancer Genome Atlas dataset, these genes are all prognostic factors for pancreatic cancer. Thus, the feasibility of using our deep learning-based method for the selection of genes associated with pancreatic cancer development and prognosis was confirmed.
Collapse
Affiliation(s)
- Yasukuni Mori
- Graduate School of Engineering, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba-shi, Chiba, 263-8522, Japan.
| | - Hajime Yokota
- Department of Diagnostic Radiology and Radiation Oncology, Graduate School of Medicine, Chiba University, 1-8-1 Inohana, Chuo-ku, Chiba-shi, Chiba, 260-8670, Japan
| | - Isamu Hoshino
- Division of Gastroenterological Surgery, Chiba Cancer Center, 666-2 Nitona-cho, Chuo-ku, Chiba-shi, Chiba, 260-8717, Japan
| | - Yosuke Iwatate
- Division of Hepato-Biliary-Pancreatic Surgery, Chiba Cancer Center, 666-2 Nitona-cho, Chuo-ku, Chiba-shi, Chiba, 260-8717, Japan
| | - Kohei Wakamatsu
- Media Data Tech Studio, CyberAgent, Inc., 13F Akihabara Daibiru, 1-18-13 Sotokanda, Chiyoda-ku, Tokyo, 101-0021, Japan
| | - Takashi Uno
- Department of Diagnostic Radiology and Radiation Oncology, Graduate School of Medicine, Chiba University, 1-8-1 Inohana, Chuo-ku, Chiba-shi, Chiba, 260-8670, Japan
| | - Hiroki Suyari
- Graduate School of Engineering, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba-shi, Chiba, 263-8522, Japan
| |
Collapse
|
23
|
Del Giudice M, Peirone S, Perrone S, Priante F, Varese F, Tirtei E, Fagioli F, Cereda M. Artificial Intelligence in Bulk and Single-Cell RNA-Sequencing Data to Foster Precision Oncology. Int J Mol Sci 2021; 22:ijms22094563. [PMID: 33925407 PMCID: PMC8123853 DOI: 10.3390/ijms22094563] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2021] [Revised: 04/21/2021] [Accepted: 04/23/2021] [Indexed: 02/01/2023] Open
Abstract
Artificial intelligence, or the discipline of developing computational algorithms able to perform tasks that requires human intelligence, offers the opportunity to improve our idea and delivery of precision medicine. Here, we provide an overview of artificial intelligence approaches for the analysis of large-scale RNA-sequencing datasets in cancer. We present the major solutions to disentangle inter- and intra-tumor heterogeneity of transcriptome profiles for an effective improvement of patient management. We outline the contributions of learning algorithms to the needs of cancer genomics, from identifying rare cancer subtypes to personalizing therapeutic treatments.
Collapse
Affiliation(s)
- Marco Del Giudice
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Candiolo Cancer Institute, FPO—IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy
| | - Serena Peirone
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Physics and INFN, Università degli Studi di Torino, via P.Giuria 1, 10125 Turin, Italy
| | - Sarah Perrone
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Physics, Università degli Studi di Torino, via P.Giuria 1, 10125 Turin, Italy
| | - Francesca Priante
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Physics, Università degli Studi di Torino, via P.Giuria 1, 10125 Turin, Italy
| | - Fabiola Varese
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Department of Life Science and System Biology, Università degli Studi di Torino, via Accademia Albertina 13, 10123 Turin, Italy
| | - Elisa Tirtei
- Paediatric Onco-Haematology Division, Regina Margherita Children’s Hospital, City of Health and Science of Turin, 10126 Turin, Italy; (E.T.); (F.F.)
| | - Franca Fagioli
- Paediatric Onco-Haematology Division, Regina Margherita Children’s Hospital, City of Health and Science of Turin, 10126 Turin, Italy; (E.T.); (F.F.)
- Department of Public Health and Paediatric Sciences, University of Torino, 10124 Turin, Italy
| | - Matteo Cereda
- Cancer Genomics and Bioinformatics Unit, IIGM—Italian Institute for Genomic Medicine, c/o IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy; (M.D.G.); (S.P.); (S.P.); (F.P.); (F.V.)
- Candiolo Cancer Institute, FPO—IRCCS, Str. Prov.le 142, km 3.95, 10060 Candiolo, TO, Italy
- Correspondence: ; Tel.: +39-011-993-3969
| |
Collapse
|
24
|
Seo H, Cho DH. Feature selection algorithm based on dual correlation filters for cancer-associated somatic variants. BMC Bioinformatics 2020; 21:486. [PMID: 33121438 PMCID: PMC7596964 DOI: 10.1186/s12859-020-03767-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 09/18/2020] [Indexed: 12/30/2022] Open
Abstract
Background Since the development of sequencing technology, an enormous amount of genetic information has been generated, and human cancer analysis using this information is drawing attention. As the effects of variants on human cancer become known, it is important to find cancer-associated variants among countless variants. Results We propose a new filter-based feature selection method applicable for extracting cancer-associated somatic variants considering correlations of data. Both variants associated with the activation and deactivation of cancer’s characteristics are analyzed using dual correlation filters. The multiobjective optimization is utilized to consider two types of variants simultaneously without redundancy. To overcome high computational complexity problem, we calculate the correlation-based weight to select significant variants instead of directly searching for the optimal subset of variants. The proposed algorithm is applied to the identification of melanoma metastasis or breast cancer stage, and the classification results of the proposed method are compared with those of conventional single correlation filter-based method. Conclusions We verified that the proposed dual correlation filter-based method can extract cancer-associated variants related to the characteristics of human cancer.
Collapse
Affiliation(s)
- Hyein Seo
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, 34141, Daejeon, Republic of Korea
| | - Dong-Ho Cho
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, 34141, Daejeon, Republic of Korea.
| |
Collapse
|
25
|
Afshar M, Usefi H. High-dimensional feature selection for genomic datasets. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
26
|
Xu D, Zhang J, Xu H, Zhang Y, Chen W, Gao R, Dehmer M. Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data. BMC Genomics 2020; 21:650. [PMID: 32962626 PMCID: PMC7510277 DOI: 10.1186/s12864-020-07038-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 08/30/2020] [Indexed: 12/19/2022] Open
Abstract
Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.
Collapse
Affiliation(s)
- Da Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Jialin Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| | - Wei Chen
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, 250061, China
| | - Matthias Dehmer
- Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, Steyr, Austria.,College of Computer and Control Engineering, Nankai University, Tianjin, 300071, China.,Department of Mechatronics and Biomedical Computer Science, UMIT, Hall in Tyrol, Austria
| |
Collapse
|
27
|
Kalina J, Matonoha C. A sparse pair-preserving centroid-based supervised learning method for high-dimensional biomedical data or images. Biocybern Biomed Eng 2020. [DOI: 10.1016/j.bbe.2020.03.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
28
|
Han X, Li D, Liu P, Wang L. Feature selection by recursive binary gravitational search algorithm optimization for cancer classification. Soft comput 2020. [DOI: 10.1007/s00500-019-04203-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
29
|
Specific glioblastoma multiforme prognostic-subtype distinctions based on DNA methylation patterns. Cancer Gene Ther 2019; 27:702-714. [PMID: 31619751 DOI: 10.1038/s41417-019-0142-6] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2019] [Revised: 10/01/2019] [Accepted: 10/04/2019] [Indexed: 12/14/2022]
Abstract
DNA methylation is an important regulator of gene expression, and plays a significant role in carcinogenesis in the brain. Here, we explored specific prognosis-subtypes based on DNA methylation status using 138 Glioblastoma Multiforme (GBM) samples from The Cancer Genome Atlas (TCGA) database. The methylation profiles of 11,637 CpG sites that significantly correlated with survival in the training set were employed for consensus clustering. We identified three GBM molecular subtypes, and their survival curves were distinct from each other. Furthermore, ten feature CpG sites were obtained on conducting a weighted gene co-expression network analysis (WGCNA) of the CpG sites. We were able to classify the samples into high- and low-methylation groups, and classified the prognosis information of the samples after cluster analysis of the training set samples using the hierarchical clustering algorithm. Similar results were obtained in the test set and clinical GBM specimens. Finally, we found that a positive relationship existed between methylation level and sensitivity to temozolomide (or radiotherapy) or anti-migration ability of GBM cells. Taken together, these results suggest that the model constructed in this study could help explain the heterogeneity of previous molecular subgroups in GBM and can provide guidance to clinicians regarding the prognosis of GBM.
Collapse
|
30
|
Shi M, Wang J, Zhang C. Integration of Cancer Genomics Data for Tree-based Dimensionality Reduction and Cancer Outcome Prediction. Mol Inform 2019; 39:e1900028. [PMID: 31490641 DOI: 10.1002/minf.201900028] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Accepted: 08/22/2019] [Indexed: 11/10/2022]
Abstract
Accurate outcome prediction is crucial for precision medicine and personalized treatment of cancer. Researchers have found that multi-dimensional cancer omics studies outperform each data type (mRNA, microRNA, methylation or somatic copy number alteration) study in human disease research. Existing methods leveraging multiple level of molecular data often suffer from various limitations, e. g., heterogeneity, poor robustness or loss of generality. To overcome these limitations, we presented the tree-based dimensionality reduction approach for the identification of smooth tree graph and developed accurate predictive model for clinical outcome prediction. We demonstrated that 1) Discriminative Dimensionality Reduction via learning a Tree (DDRTree) achieved reduced dimension space tree with statistical significance; 2) Tree based support vector machine (SVM) classifier improved prediction performance of cancer recurrence as compared to t-test based SVM classifier; 3) Tree based SVM classifier was robust with regard to the different number of multi-markers; 4) Combining multiple omics data improved prediction performance of cancer recurrence as compared to a single-omics data; and 5) Tree based SVM classifier achieved similar or better prediction performance when compared to the features from state-of-the-art feature selection methods. Our results demonstrated great potential of the tree-based dimensionality reduction approach based clinical outcome prediction.
Collapse
Affiliation(s)
- Mingguang Shi
- School of Electric Engineering and Automation, Hefei University of Technology, Hefei, Anhui, 230009, China
| | - Junwen Wang
- School of Electric Engineering and Automation, Hefei University of Technology, Hefei, Anhui, 230009, China
| | - Chenyu Zhang
- School of Electric Engineering and Automation, Hefei University of Technology, Hefei, Anhui, 230009, China
| |
Collapse
|