1
|
Paul D, Sinnarasan VSP, Das R, Sheikh MMR, Venkatesan A. Machine learning approach to predict blood-secretory proteins and potential biomarkers for liver cancer using omics data. J Proteomics 2024; 309:105298. [PMID: 39216516 DOI: 10.1016/j.jprot.2024.105298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 08/22/2024] [Accepted: 08/29/2024] [Indexed: 09/04/2024]
Abstract
Identifying non-invasive blood-based biomarkers is crucial for early detection and monitoring of liver cancer (LC), thereby improving patient outcomes. This study leveraged computational approaches to predict potential blood-based biomarkers for LC. Machine learning (ML) models were developed using selected features from blood-secretory proteins collected from the curated databases. The logistic regression (LR) model demonstrated the optimal performance. Transcriptome analysis across 7 LC cohorts revealed 231 common differentially expressed genes (DEGs). The encoded proteins of these DEGs were compared with the ML dataset, revealing 29 proteins overlapping with the blood-secretory dataset. The LR model also predicted 29 additional proteins as blood-secretory with the remaining protein-coding genes. As a result, 58 potential blood-secretory proteins were obtained. Among the top 20 genes, 13 common hub genes were identified. Further, area under the receiver operating characteristic curve (ROC AUC) analysis was performed to assess the genes as potential diagnostic blood biomarkers. Six genes, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6, exhibited an AUC value higher than 0.85 and were predicted as blood-secretory. This study highlights the potential of an integrative computational approach for discovering non-invasive blood-based biomarkers in LC, facilitating for further validation and clinical translation. SIGNIFICANCE: Liver cancer is one of the leading causes of premature death worldwide, with its prevalence and mortality rates projected to increase. Although current diagnostic methods are highly sensitive, they are invasive and unsuitable for repeated testing. Blood biomarkers offer a promising non-invasive alternative, but their wide dynamic range of protein concentration poses experimental challenges. Therefore, utilizing available omics data to develop a diagnostic model could provide a potential solution for accurate diagnosis. This study developed a computational method integrating machine learning and bioinformatics analysis to identify potential blood biomarkers. As a result, ESM1, FCN2, MDK, GPC3, CTHRC1 and COL6A6 biomarkers were identified, holding significant promise for improving diagnosis and understanding of liver cancer. The integrated method can be applied to other cancers, offering a possible solution for early detection and improved patient outcomes.
Collapse
Affiliation(s)
- Dahrii Paul
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India
| | | | - Rajesh Das
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India
| | | | - Amouda Venkatesan
- Department of Bioinformatics, Pondicherry University, Puducherry 605014, India.
| |
Collapse
|
2
|
Kundu P, Beura S, Mondal S, Das AK, Ghosh A. Machine learning for the advancement of genome-scale metabolic modeling. Biotechnol Adv 2024; 74:108400. [PMID: 38944218 DOI: 10.1016/j.biotechadv.2024.108400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 05/13/2024] [Accepted: 06/23/2024] [Indexed: 07/01/2024]
Abstract
Constraint-based modeling (CBM) has evolved as the core systems biology tool to map the interrelations between genotype, phenotype, and external environment. The recent advancement of high-throughput experimental approaches and multi-omics strategies has generated a plethora of new and precise information from wide-ranging biological domains. On the other hand, the continuously growing field of machine learning (ML) and its specialized branch of deep learning (DL) provide essential computational architectures for decoding complex and heterogeneous biological data. In recent years, both multi-omics and ML have assisted in the escalation of CBM. Condition-specific omics data, such as transcriptomics and proteomics, helped contextualize the model prediction while analyzing a particular phenotypic signature. At the same time, the advanced ML tools have eased the model reconstruction and analysis to increase the accuracy and prediction power. However, the development of these multi-disciplinary methodological frameworks mainly occurs independently, which limits the concatenation of biological knowledge from different domains. Hence, we have reviewed the potential of integrating multi-disciplinary tools and strategies from various fields, such as synthetic biology, CBM, omics, and ML, to explore the biochemical phenomenon beyond the conventional biological dogma. How the integrative knowledge of these intersected domains has improved bioengineering and biomedical applications has also been highlighted. We categorically explained the conventional genome-scale metabolic model (GEM) reconstruction tools and their improvement strategies through ML paradigms. Further, the crucial role of ML and DL in omics data restructuring for GEM development has also been briefly discussed. Finally, the case-study-based assessment of the state-of-the-art method for improving biomedical and metabolic engineering strategies has been elaborated. Therefore, this review demonstrates how integrating experimental and in silico strategies can help map the ever-expanding knowledge of biological systems driven by condition-specific cellular information. This multiview approach will elevate the application of ML-based CBM in the biomedical and bioengineering fields for the betterment of society and the environment.
Collapse
Affiliation(s)
- Pritam Kundu
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Satyajit Beura
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Suman Mondal
- P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India
| | - Amit Kumar Das
- Department of Bioscience and Biotechnology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
| | - Amit Ghosh
- School School of Energy Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal 721302, India; P.K. Sinha Centre for Bioenergy and Renewables, Indian Institute of Technology Kharagpur, West Bengal 721302, India.
| |
Collapse
|
3
|
Chellappan D, Rajaguru H. Machine Learning Meets Meta-Heuristics: Bald Eagle Search Optimization and Red Deer Optimization for Feature Selection in Type II Diabetes Diagnosis. Bioengineering (Basel) 2024; 11:766. [PMID: 39199724 PMCID: PMC11351847 DOI: 10.3390/bioengineering11080766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 07/10/2024] [Accepted: 07/22/2024] [Indexed: 09/01/2024] Open
Abstract
This article investigates the effectiveness of feature extraction and selection techniques in enhancing the performance of classifier accuracy in Type II Diabetes Mellitus (DM) detection using microarray gene data. To address the inherent high dimensionality of the data, three feature extraction (FE) methods are used, namely Short-Time Fourier Transform (STFT), Ridge Regression (RR), and Pearson's Correlation Coefficient (PCC). To further refine the data, meta-heuristic algorithms like Bald Eagle Search Optimization (BESO) and Red Deer Optimization (RDO) are utilized for feature selection. The performance of seven classification techniques, Non-Linear Regression-NLR, Linear Regression-LR, Gaussian Mixture Models-GMMs, Expectation Maximization-EM, Logistic Regression-LoR, Softmax Discriminant Classifier-SDC, and Support Vector Machine with Radial Basis Function kernel-SVM-RBF, are evaluated with and without feature selection. The analysis reveals that the combination of PCC with SVM-RBF achieved a promising accuracy of 92.85% even without feature selection. Notably, employing BESO with PCC and SVM-RBF maintained this high accuracy. However, the highest overall accuracy of 97.14% was achieved when RDO was used for feature selection alongside PCC and SVM-RBF. These findings highlight the potential of feature extraction and selection techniques, particularly RDO with PCC, in improving the accuracy of DM detection using microarray gene data.
Collapse
Affiliation(s)
- Dinesh Chellappan
- Department of Electrical and Electronics Engineering, KPR Institute of Engineering and Technology, Coimbatore 641 407, Tamil Nadu, India;
| | - Harikumar Rajaguru
- Department of Electronics and Communication Engineering, Bannari Amman Institute of Technology, Sathyamangalam 638 401, Tamil Nadu, India
| |
Collapse
|
4
|
Wei HT, Xie LY, Liu YG, Deng Y, Chen F, Lv F, Tang LP, Hu BL. Elucidating the role of angiogenesis-related genes in colorectal cancer: a multi-omics analysis. Front Oncol 2024; 14:1413273. [PMID: 38962272 PMCID: PMC11220232 DOI: 10.3389/fonc.2024.1413273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 05/31/2024] [Indexed: 07/05/2024] Open
Abstract
Background Angiogenesis plays a pivotal role in colorectal cancer (CRC), yet its underlying mechanisms demand further exploration. This study aimed to elucidate the significance of angiogenesis-related genes (ARGs) in CRC through comprehensive multi-omics analysis. Methods CRC patients were categorized according to ARGs expression to form angiogenesis-related clusters (ARCs). We investigated the correlation between ARCs and patient survival, clinical features, consensus molecular subtypes (CMS), cancer stem cell (CSC) index, tumor microenvironment (TME), gene mutations, and response to immunotherapy. Utilizing three machine learning algorithms (LASSO, Xgboost, and Decision Tree), we screen key ARGs associated with ARCs, further validated in independent cohorts. A prognostic signature based on key ARGs was developed and analyzed at the scRNA-seq level. Validation of gene expression in external cohorts, clinical tissues, and blood samples was conducted via RT-PCR assay. Results Two distinct ARC subtypes were identified and were significantly associated with patient survival, clinical features, CMS, CSC index, and TME, but not with gene mutations. Four genes (S100A4, COL3A1, TIMP1, and APP) were identified as key ARCs, capable of distinguishing ARC subtypes. The prognostic signature based on these genes effectively stratified patients into high- or low-risk categories. scRNA-seq analysis showed that these genes were predominantly expressed in immune cells rather than in cancer cells. Validation in two external cohorts and through clinical samples confirmed significant expression differences between CRC and controls. Conclusion This study identified two ARG subtypes in CRC and highlighted four key genes associated with these subtypes, offering new insights into personalized CRC treatment strategies.
Collapse
Affiliation(s)
- Hao-tang Wei
- Department of Gastrointestinal Surgery, Third Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Li-ye Xie
- Department of Research, Guangxi Medical University Cancer Hospital, Nanning, China
| | - Yong-gang Liu
- Department of Gastrointestinal Surgery, Third Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Ya Deng
- Department of Gastrointestinal Surgery, Third Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Feng Chen
- Department of Gastrointestinal Surgery, Third Affiliated Hospital of Guangxi Medical University, Nanning, China
| | - Feng Lv
- Department of Research, Guangxi Medical University Cancer Hospital, Nanning, China
| | - Li-ping Tang
- Department of Information, Library of Guangxi Medical University, Nanning, China
| | - Bang-li Hu
- Department of Research, Guangxi Medical University Cancer Hospital, Nanning, China
| |
Collapse
|
5
|
Ye C, Zhu S, Yuan J, Yuan X. FPR1, as a Potential Biomarker of Diagnosis and Infliximab Therapy Responses for Crohn's Disease, is Related to Disease Activity, Inflammation and Macrophage Polarization. J Inflamm Res 2024; 17:3949-3966. [PMID: 38911989 PMCID: PMC11193993 DOI: 10.2147/jir.s459819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Accepted: 06/12/2024] [Indexed: 06/25/2024] Open
Abstract
Purpose Crohn's disease (CD) represents a multifaceted inflammatory gastrointestinal condition, with a profound significance placed on unraveling its molecular pathways to enhance both diagnostic capabilities and therapeutic interventions. This study focused on identifying a robust macrophage-related signatures (MacroSig) for diagnosing CD, emphasizing the role of FPR1 in macrophage polarization and its implications in CD. Patients and Methods Expression profiles from intestinal biopsies and macrophages of 1804 CD patients were retrieved from the Gene Expression Omnibus (GEO). Utilizing CIBERSORTx, differential expression analysis, and weighted correlation network analysis to to identify macrophage-related genes (MRGs). By unsupervised clustering, distinct clusters of CD were identified. Potential biomarkers were identified via using four machine learning algorithms, leading to the establishment of MacroSig which combines insights from 12 machine learning algorithms. Furthermore, the expression of FPR1 was verified in intestinal biopsies of CD patients and two murine experimental colitis models. Finally, we further explored the role of FPR1 in macrophage polarization through single-cell analysis as well as through the study of RAW264.7 cells and peritoneal macrophages. Results Two distinct clusters with differential levels of macrophage infiltration and inflammation were identified. The MacroSig, which included FPR1 and LILRB2, exhibited high diagnostic accuracy and outperformed existing biomarkers and signatures. Clinical analysis demonstrated a strong correlation of FPR1 with disease activity, endoscopic inflammation status, and response to infliximab treatment. The expression levels of FPR1 were validated in our CD cohort by immunohistochemistry and confirmed in two colitis mouse models. Single-cell analysis indicated that FPR1 is predominantly expressed in macrophages and monocytes. In vitro studies demonstrated that FPR1 was upregulated in M1 macrophages, and its activation promoted M1 polarization. Conclusion We developed a promising diagnostic signature for CD, and targeting FPR1 to modulate macrophage polarization may represent a novel therapeutic strategy.
Collapse
Affiliation(s)
- Chenglin Ye
- Department of Pathology, Renmin Hospital of Wuhan University, Wuhan, Hubei, People’s Republic of China
| | - Sizhe Zhu
- Department of Otolaryngology-Head and Neck Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Sciences and Technology, Wuhan, Hubei, People’s Republic of China
| | - Jingping Yuan
- Department of Pathology, Renmin Hospital of Wuhan University, Wuhan, Hubei, People’s Republic of China
| | - Xiuxue Yuan
- Medical College of Wuhan University of Science and Technology, Wuhan, Hubei, People’s Republic of China
| |
Collapse
|
6
|
Modlin IM, Kidd M, Drozdov IA, Boegemann M, Bodei L, Kunikowska J, Malczewska A, Bernemann C, Koduru SV, Rahbar K. Development of a multigenomic liquid biopsy (PROSTest) for prostate cancer in whole blood. Prostate 2024; 84:850-865. [PMID: 38571290 DOI: 10.1002/pros.24704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Revised: 03/04/2024] [Accepted: 03/25/2024] [Indexed: 04/05/2024]
Abstract
INTRODUCTION We describe the development of a molecular assay from publicly available tumor tissue mRNA databases using machine learning and present preliminary evidence of functionality as a diagnostic and monitoring tool for prostate cancer (PCa) in whole blood. MATERIALS AND METHODS We assessed 1055 PCas (public microarray data sets) to identify putative mRNA biomarkers. Specificity was confirmed against 32 different solid and hematological cancers from The Cancer Genome Atlas (n = 10,990). This defined a 27-gene panel which was validated by qPCR in 50 histologically confirmed PCa surgical specimens and matched blood. An ensemble classifier (Random Forest, Support Vector Machines, XGBoost) was trained in age-matched PCas (n = 294), and in 72 controls and 64 BPH. Classifier performance was validated in two independent sets (n = 263 PCas; n = 99 controls). We assessed the panel as a postoperative disease monitor in a radical prostatectomy cohort (RPC: n = 47). RESULTS A PCa-specific 27-gene panel was identified. Matched blood and tumor gene expression levels were concordant (r = 0.72, p < 0.0001). The ensemble classifier ("PROSTest") was scaled 0%-100% and the industry-standard operating point of ≥50% used to define a PCa. Using this, the PROSTest exhibited an 85% sensitivity and 95% specificity for PCa versus controls. In two independent sets, the metrics were 92%-95% sensitivity and 100% specificity. In the RPCs (n = 47), PROSTest scores decreased from 72% ± 7% to 33% ± 16% (p < 0.0001, Mann-Whitney test). PROSTest was 26% ± 8% in 37 with normal postoperative PSA levels (<0.1 ng/mL). In 10 with elevated postoperative PSA, PROSTest was 60% ± 4%. CONCLUSION A 27-gene whole blood signature for PCa is concordant with tissue mRNA levels. Measuring blood expression provides a minimally invasive genomic tool that may facilitate prostate cancer management.
Collapse
Affiliation(s)
- Irvin M Modlin
- Yale University School of Medicine, New Haven, Connecticut, USA
| | - Mark Kidd
- Wren Laboratories LLC, Branford, Connecticut, USA
| | | | - Martin Boegemann
- Department of Urology, Münster University Hospital, Münster, Germany
| | - Lisa Bodei
- Department of Radiology, Molecular Imaging and Therapy Service, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Jolanta Kunikowska
- Department of Nuclear Medicine, Medical University of Warsaw, Warsaw, Poland
| | - Anna Malczewska
- Department of Endocrinology, Medical University of Silesia, Katowice, Poland
| | | | | | - Kambiz Rahbar
- Department of Nuclear Medicine, Münster University Hospital, Münster, Germany
| |
Collapse
|
7
|
Tang C, Sun Q, Zeng X, Yang X, Liu F, Zhao J, Shen Y, Liu B, Wen J, Li Y. Cell-type specific inference from bulk RNA-sequencing data by integrating single cell reference profiles via EPIC-unmix. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.23.595514. [PMID: 38826297 PMCID: PMC11142188 DOI: 10.1101/2024.05.23.595514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Cell type specific (CTS) analysis is essential to reveal biological insights obscured in bulk tissue data. However, single-cell (sc) or single-nuclei (sn) resolution data are still cost-prohibitive for large-scale samples. Thus, computational methods to perform deconvolution from bulk tissue data are highly valuable. We here present EPIC-unmix, a novel two-step empirical Bayesian method integrating reference sc/sn RNA-seq data and bulk RNA-seq data from target samples to enhance the accuracy of CTS inference. We demonstrate through comprehensive simulations across three tissues that EPIC-unmix achieved 4.6% - 109.8% higher accuracy compared to alternative methods. By applying EPIC-unmix to human bulk brain RNA-seq data from the ROSMAP and MSBB cohorts, we identified multiple genes differentially expressed between Alzheimer's disease (AD) cases versus controls in a CTS manner, including 57.4% novel genes not identified using similar sample size sc/snRNA-seq data, indicating the power of our in-silico approach. Among the 6-69% overlapping, 83%-100% are in consistent direction with those from sc/snRNA-seq data, supporting the reliability of our findings. EPIC-unmix inferred CTS expression profiles similarly empowers CTS eQTL analysis. Among the novel eQTLs, we highlight a microglia eQTL for AD risk gene AP3B2, obscured in bulk and missed by sc/snRNA-seq based eQTL analysis. The variant resides in a microglia-specific cCRE, forming chromatin loop with AP3B2 promoter region in microglia. Taken together, we believe EPIC-unmix will be a valuable tool to enable more powerful CTS analysis.
Collapse
Affiliation(s)
- Chenwei Tang
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Quan Sun
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Xinyue Zeng
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Xiaoyu Yang
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Fei Liu
- Department of Pharmacy and Pharmaceutical Sciences, Faculty of Science, National University of Singapore, Singapore
| | - Jinying Zhao
- Department of Epidemiology, College of Public Health & Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA; Center for Genetic Epidemiology and Bioinformatics, University of Florida, Gainesville, FL, USA
| | - Yin Shen
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Bixiang Liu
- Department of Pharmacy and Pharmaceutical Sciences, Faculty of Science, National University of Singapore, Singapore
- Department of Biomedical Informatics, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Jia Wen
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC
| | - Yun Li
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC
| |
Collapse
|
8
|
Osama S, Ali M, Ali AA, Shaban H. Gene selection and tumor identification based on a hybrid of the multi-filter embedded recursive mountain gazelle algorithm. Comput Biol Med 2023; 167:107674. [PMID: 37976816 DOI: 10.1016/j.compbiomed.2023.107674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 10/09/2023] [Accepted: 11/06/2023] [Indexed: 11/19/2023]
Abstract
Microarray gene expression data are useful for identifying gene expression patterns associated with cancer outcomes; however, their high dimensionality make it difficult to extract meaningful information and accurately classify tumors. Hence, developing effective methods for reducing dimensionality while preserving relevant information is a crucial task. Hybrid-based gene selection methods are widely proposed in the gene expression analysis domain and can still be enhanced in terms of efficiency and reliability. This study proposes a new hybrid-based gene selection method, called multi-filter embedded mountain gazelle optimizer (MUL-MGO), which utilizes two filters and an embedded method to remove irrelevant genes, followed by selecting the most relevant genes using recently developed MGO algorithm. To the best of our knowledge, this is the first work to exploit MGO as a gene or feature selection method. A new version of MGO, called recursive mountain gazelle optimizer (RMGO), which implements MGO algorithm recursively to avoid local optima, minimize search space, and obtain minimum gene count without decreasing the classifier's performance, is developed. The proposed RMGO is used to develop a new hybrid gene selection method employing similar filters and embedded methods as MUL-MGO, but with a recursive MGO algorithm version. The resulting method is called multi-filter embedded recursive mountain gazelle optimizer (MUL-RMGO). Several classifiers are used for cancer classification. Accordingly, several experimental studies are performed on eight microarray gene expression datasets to demonstrate the proficiencies of MUL-MGO and MUL-RMGO methods. The experimental findings indicate the efficiency and productivity of the suggested MUL-MGO and MUL-RMGO methods for gene selection. The methods outperform cutting-edge methods in the literature, with MUL-RMGO exceeding MUL-MGO in terms of accuracy and selected gene count.
Collapse
Affiliation(s)
- Sarah Osama
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| | - Moatez Ali
- Department of Internal Medicine, St. Barnabas Hospital, NY, USA.
| | - Abdelmgeid A Ali
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| | - Hassan Shaban
- Computer Science Department, Faculty of Computers and Information, Minia University, Minia, Egypt.
| |
Collapse
|
9
|
Jiang J, van Ertvelde J, Ertaylan G, Peeters R, Jennen D, de Kok TM, Vinken M. Unraveling the mechanisms underlying drug-induced cholestatic liver injury: identifying key genes using machine learning techniques on human in vitro data sets. Arch Toxicol 2023; 97:2969-2981. [PMID: 37603094 PMCID: PMC10504391 DOI: 10.1007/s00204-023-03583-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2023] [Accepted: 08/10/2023] [Indexed: 08/22/2023]
Abstract
Drug-induced intrahepatic cholestasis (DIC) is a main type of hepatic toxicity that is challenging to predict in early drug development stages. Preclinical animal studies often fail to detect DIC in humans. In vitro toxicogenomics assays using human liver cells have become a practical approach to predict human-relevant DIC. The present study was set up to identify transcriptomic signatures of DIC by applying machine learning algorithms to the Open TG-GATEs database. A total of nine DIC compounds and nine non-DIC compounds were selected, and supervised classification algorithms were applied to develop prediction models using differentially expressed features. Feature selection techniques identified 13 genes that achieved optimal prediction performance using logistic regression combined with a sequential backward selection method. The internal validation of the best-performing model showed accuracy of 0.958, sensitivity of 0.941, specificity of 0.978, and F1-score of 0.956. Applying the model to an external validation set resulted in an average prediction accuracy of 0.71. The identified genes were mechanistically linked to the adverse outcome pathway network of DIC, providing insights into cellular and molecular processes during response to chemical toxicity. Our findings provide valuable insights into toxicological responses and enhance the predictive accuracy of DIC prediction, thereby advancing the application of transcriptome profiling in designing new approach methodologies for hazard identification.
Collapse
Affiliation(s)
- Jian Jiang
- Entity of In Vitro Toxicology and Dermato‑Cosmetology, Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090, Brussels, Belgium.
| | - Jonas van Ertvelde
- Entity of In Vitro Toxicology and Dermato‑Cosmetology, Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090, Brussels, Belgium
| | - Gökhan Ertaylan
- Vlaamse Instelling voor Technologisch Onderzoek (VITO) NV, Health, Boeretang 200, 2400, Mol, Belgium
| | - Ralf Peeters
- Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, The Netherlands
- Department of Advanced Computing Sciences, Maastricht University, Maastricht, The Netherlands
| | - Danyel Jennen
- Department of Toxicogenomics, GROW School for Oncology and Reproduction, Maastricht University, Maastricht, The Netherlands
| | - Theo M de Kok
- Maastricht Centre for Systems Biology (MaCSBio), Maastricht University, Maastricht, The Netherlands
- Department of Toxicogenomics, GROW School for Oncology and Reproduction, Maastricht University, Maastricht, The Netherlands
| | - Mathieu Vinken
- Entity of In Vitro Toxicology and Dermato‑Cosmetology, Department of Pharmaceutical and Pharmacological Sciences, Vrije Universiteit Brussel, Laarbeeklaan 103, 1090, Brussels, Belgium.
| |
Collapse
|
10
|
Padwal MK, Basu S, Basu B. Application of Machine Learning in Predicting Hepatic Metastasis or Primary Site in Gastroenteropancreatic Neuroendocrine Tumors. Curr Oncol 2023; 30:9244-9261. [PMID: 37887568 PMCID: PMC10605255 DOI: 10.3390/curroncol30100668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Revised: 10/16/2023] [Accepted: 10/16/2023] [Indexed: 10/28/2023] Open
Abstract
Gastroenteropancreatic neuroendocrine tumors (GEP-NETs) account for 80% of gastroenteropancreatic neuroendocrine neoplasms (GEP-NENs). GEP-NETs are well-differentiated tumors, highly heterogeneous in biology and origin, and are often diagnosed at the metastatic stage. Diagnosis is commonly through clinical symptoms, histopathology, and PET-CT imaging, while molecular markers for metastasis and the primary site are unknown. Here, we report the identification of multi-gene signatures for hepatic metastasis and primary sites through analyses on RNA-SEQ datasets of pancreatic and small intestinal NETs tissue samples. Relevant gene features, identified from the normalized RNA-SEQ data using the mRMRe algorithm, were used to develop seven Machine Learning models (LDA, RF, CART, k-NN, SVM, XGBOOST, GBM). Two multi-gene random forest (RF) models classified primary and metastatic samples with 100% accuracy in training and test cohorts and >90% accuracy in an independent validation cohort. Similarly, three multi-gene RF models identified the pancreas or small intestine as the primary site with 100% accuracy in training and test cohorts, and >95% accuracy in an independent cohort. Multi-label models for concurrent prediction of hepatic metastasis and primary site returned >98.42% and >87.42% accuracies on training and test cohorts, respectively. A robust molecular signature to predict liver metastasis or the primary site for GEP-NETs is reported for the first time and could complement the clinical management of GEP-NETs.
Collapse
Affiliation(s)
- Mahesh Kumar Padwal
- Molecular Biology Division, Bhabha Atomic Research Centre, Mumbai 400085, India;
- Homi Bhabha National Institute, Mumbai 400094, India;
| | - Sandip Basu
- Homi Bhabha National Institute, Mumbai 400094, India;
- Radiation Medicine Centre, Bhabha Atomic Research Centre, Tata Memorial Hospital Annexe, Mumbai 400012, India
| | - Bhakti Basu
- Molecular Biology Division, Bhabha Atomic Research Centre, Mumbai 400085, India;
- Homi Bhabha National Institute, Mumbai 400094, India;
| |
Collapse
|
11
|
Mizrahi L, Choudhary A, Ofer P, Goldberg G, Milanesi E, Kelsoe JR, Gurwitz D, Alda M, Gage FH, Stern S. Immunoglobulin genes expressed in lymphoblastoid cell lines discern and predict lithium response in bipolar disorder patients. Mol Psychiatry 2023; 28:4280-4293. [PMID: 37488168 PMCID: PMC10827667 DOI: 10.1038/s41380-023-02183-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 07/03/2023] [Accepted: 07/06/2023] [Indexed: 07/26/2023]
Abstract
Bipolar disorder (BD) is a neuropsychiatric mood disorder manifested by recurrent episodes of mania and depression. More than half of BD patients are non-responsive to lithium, the first-line treatment drug, complicating BD clinical management. Given its unknown etiology, it is pertinent to understand the genetic signatures that lead to variability in lithium response. We discovered a set of differentially expressed genes (DEGs) from the lymphoblastoid cell lines (LCLs) of 10 controls and 19 BD patients belonging mainly to the immunoglobulin gene family that can be used as potential biomarkers to diagnose and treat BD. Importantly, we trained machine learning algorithms on our datasets that predicted the lithium response of BD subtypes with minimal errors, even when used on a different cohort of 24 BD patients acquired by a different laboratory. This proves the scalability of our methodology for predicting lithium response in BD and for a prompt and suitable decision on therapeutic interventions.
Collapse
Affiliation(s)
- Liron Mizrahi
- Sagol Department of Neurobiology, Faculty of Natural Sciences, University of Haifa, Haifa, 3498838, Israel
| | - Ashwani Choudhary
- Sagol Department of Neurobiology, Faculty of Natural Sciences, University of Haifa, Haifa, 3498838, Israel
| | - Polina Ofer
- Sagol Department of Neurobiology, Faculty of Natural Sciences, University of Haifa, Haifa, 3498838, Israel
| | - Gabriela Goldberg
- Laboratory of Genetics, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA
| | - Elena Milanesi
- Victor Babes National Institute of Pathology, Bucharest, 050096, Romania
| | - John R Kelsoe
- Department of Psychiatry, University of California, San Diego, La Jolla, CA, 92093, USA
| | - David Gurwitz
- Department of Human Molecular Genetics and Biochemistry, Faculty of Medicine, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Martin Alda
- Department of Psychiatry, Dalhousie University, Halifax, NS, B3H 2E2, Canada
| | - Fred H Gage
- Laboratory of Genetics, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA
| | - Shani Stern
- Sagol Department of Neurobiology, Faculty of Natural Sciences, University of Haifa, Haifa, 3498838, Israel.
| |
Collapse
|
12
|
Nazari L, Aslan MF, Sabanci K, Ropelewska E. Integrated transcriptomic meta-analysis and comparative artificial intelligence models in maize under biotic stress. Sci Rep 2023; 13:15899. [PMID: 37741865 PMCID: PMC10517993 DOI: 10.1038/s41598-023-42984-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 09/17/2023] [Indexed: 09/25/2023] Open
Abstract
Biotic stress imposed by pathogens, including fungal, bacterial, and viral, can cause heavy damage leading to yield reduction in maize. Therefore, the identification of resistant genes paves the way to the development of disease-resistant cultivars and is essential for reliable production in maize. Identifying different gene expression patterns can deepen our perception of maize resistance to disease. This study includes machine learning and deep learning-based application for classifying genes expressed under normal and biotic stress in maize. Machine learning algorithms used are Naive Bayes (NB), K-Nearest Neighbor (KNN), Ensemble, Support Vector Machine (SVM), and Decision Tree (DT). A Bidirectional Long Short Term Memory (BiLSTM) based network with Recurrent Neural Network (RNN) architecture is proposed for gene classification with deep learning. To increase the performance of these algorithms, feature selection is made from the raw gene features through the Relief feature selection algorithm. The obtained finding indicated the efficacy of BiLSTM over other machine learning algorithms. Some top genes ((S)-beta-macrocarpene synthase, zealexin A1 synthase, polyphenol oxidase I, chloroplastic, pathogenesis-related protein 10, CHY1, chitinase chem 5, barwin, and uncharacterized LOC100273479 were proved to be differentially upregulated under biotic stress condition.
Collapse
Affiliation(s)
- Leyla Nazari
- Crop and Horticultural Science Research Department, Fars Agricultural and Natural Resources Research and Education Center, Agricultural Research, Education and Extension Organization (AREEO), Shiraz, Iran.
| | - Muhammet Fatih Aslan
- Electrical and Electronics Engineering, Karamanoglu Mehmetbey University, Karaman, Turkey
| | - Kadir Sabanci
- Electrical and Electronics Engineering, Karamanoglu Mehmetbey University, Karaman, Turkey
| | - Ewa Ropelewska
- Fruit and Vegetable Storage and Processing Department, The National Institute of Horticultural Research, Skierniewice, Poland
| |
Collapse
|
13
|
Mohseni N, Ghaniee Zarich M, Afshar S, Hosseini M. Identification of Novel Biomarkers for Response to Preoperative Chemoradiation in Locally Advanced Rectal Cancer with Genetic Algorithm-Based Gene Selection. J Gastrointest Cancer 2023; 54:937-950. [PMID: 36534304 DOI: 10.1007/s12029-022-00873-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/05/2022] [Indexed: 12/23/2022]
Abstract
BACKGROUND The conventional treatment for patients with locally advanced colorectal tumors is preoperative chemo-radiotherapy (PCRT) preceding surgery. This treatment strategy has some long-term side effects, and some patients do not respond to it. Therefore, an evaluation of biomarkers that may help predict patients' response to PCRT is essential. METHODS We took advantage of genetic algorithm to search the space of possible combinations of features to choose subsets of genes that would yield convenient performance in differentiating PCRT responders from non-responders using a logistic regression model as our classifier. RESULTS We developed two gene signatures; first, to achieve the maximum prediction accuracy, the algorithm yielded 39 genes, and then, aiming to reduce the feature numbers as much as possible (while maintaining acceptable performance), a 5-gene signature was chosen. The performance of the two gene signatures was (accuracy = 0.97 and 0.81, sensitivity = 0.96 and 0.83, and specificity = 86 and 0.77) using a logistic regression classifier. Through analyzing bias and variance decomposition of the model error, we further investigated the involved genes by discovering and validating another 28-gene signature which possibly points towards two different sub-systems involved in the response of the patients to treatment. CONCLUSIONS Using genetic algorithm as our gene selection method, we have identified two groups of genes that can differentiate PCRT responders from non-responders in patients of the studied dataset with considerable performance. IMPACT After passing standard requirements, our gene signatures may be applicable as a robust and effective PCRT response prediction tool for colorectal cancer patients in clinical settings and may also help future studies aiming to further investigate involved pathways gain a clearer picture for the course of their research.
Collapse
Affiliation(s)
- Nima Mohseni
- Department of Biology, Faculty of Science, Lund University, Skåne, Sweden
| | | | - Saeid Afshar
- Research Center for Molecular Medicine, Hamadan University of Medical Sciences, Hamadan, Iran.
| | | |
Collapse
|
14
|
Li M, Zhu R, Li G, Yin S, Zeng L, Bai Z, Chen J, Jiang B, Li L, Wu Y. Point-of-care testing for cerebral edema types based on symmetric cancellation near-field coupling phase shift and support vector machine. Biomed Eng Online 2023; 22:80. [PMID: 37582824 PMCID: PMC10428563 DOI: 10.1186/s12938-023-01145-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 08/07/2023] [Indexed: 08/17/2023] Open
Abstract
BACKGROUND Cerebral edema is an extremely common secondary disease in post-stroke. Point-of-care testing for cerebral edema types has important clinical significance for the precise management to prevent poor prognosis. Nevertheless, there has not been a fully accepted bedside testing method for that. METHODS A symmetric cancellation near-field coupling phase shift (NFCPS) monitoring system is established based on the symmetry of the left and right hemispheres and the fact that unilateral lesions do not affect healthy hemispheres. For exploring the feasibility of this system to reflect the occurrence and development of cerebral edema, 13 rabbits divided into experimental group (n = 8) and control group (n = 5) were performed 24-h NFCPS continuous monitoring experiments. After time difference offset and feature band averaging processing, the changing trend of NFCPS at the stages dominated by cytotoxic edema (CE) and vasogenic edema (VE), respectively, was analyzed. Furthermore, the features under the different time windows were extracted. Then, a discriminative model of cerebral edema types based on support vector machines (SVM) was established and performance of multiple feature combinations was compared. RESULTS The NFCPS monitoring outcomes of experimental group endured focal ischemia modeling by thrombin injection show a trend of first decreasing and then increasing, reaching the lowest value of - 35.05° at the 6th hour. Those of control group do not display obvious upward or downward trend and only fluctuate around the initial value with an average change of - 0.12°. Furthermore, four features under the 1-h and 2-h time windows were extracted. Based on the discriminative model of cerebral edema types, the classification accuracy of 1-h window is higher than 90% and the specificity is close to 1, which is almost the same as the performance of the 2-h window. CONCLUSION This study proves the feasibility of NFCPS technology combined with SVM to distinguish cerebral edema types in a short time, which is promised to become a new solution for immediate and precise management of dehydration therapy after ischemic stroke.
Collapse
Affiliation(s)
- Mingyan Li
- School of Pharmacy and Bioengineering, Chongqing University of Technology, Chongqing, 400054 China
- College of Artificial Intelligence, Chongqing University of Technology, Chongqing, 401135 China
| | - Rui Zhu
- School of Pharmacy and Bioengineering, Chongqing University of Technology, Chongqing, 400054 China
| | - Gen Li
- School of Pharmacy and Bioengineering, Chongqing University of Technology, Chongqing, 400054 China
- Department of Neurosurgery, Southwest Hospital, Army Medical University, Chongqing, 400038 China
| | - Shengtong Yin
- School of Pharmacy and Bioengineering, Chongqing University of Technology, Chongqing, 400054 China
| | - Lingxi Zeng
- School of Pharmacy and Bioengineering, Chongqing University of Technology, Chongqing, 400054 China
| | - Zelin Bai
- College of Biomedical Engineering, Army Medical University, Chongqing, 400038 China
| | - Jingbo Chen
- College of Biomedical Engineering, Army Medical University, Chongqing, 400038 China
| | - Bin Jiang
- College of Artificial Intelligence, Chongqing University of Technology, Chongqing, 401135 China
| | - Lihong Li
- College of Artificial Intelligence, Chongqing University of Technology, Chongqing, 401135 China
| | - Yu Wu
- School of Pharmacy and Bioengineering, Chongqing University of Technology, Chongqing, 400054 China
| |
Collapse
|
15
|
Zinati Z, Nazari L. Deciphering the molecular basis of abiotic stress response in cucumber (Cucumis sativus L.) using RNA-Seq meta-analysis, systems biology, and machine learning approaches. Sci Rep 2023; 13:12942. [PMID: 37558755 PMCID: PMC10412635 DOI: 10.1038/s41598-023-40189-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Accepted: 08/06/2023] [Indexed: 08/11/2023] Open
Abstract
Abiotic stress in cucumber (Cucumis sativus L.) may trigger distinct transcriptome responses, resulting in significant yield loss. More insight into the molecular underpinnings of the stress response can be gained by combining RNA-Seq meta-analysis with systems biology and machine learning. This can help pinpoint possible targets for engineering abiotic tolerance by revealing functional modules and key genes essential for the stress response. Therefore, to investigate the regulatory mechanism and key genes, a combination of these approaches was utilized in cucumber subjected to various abiotic stresses. Three significant abiotic stress-related modules were identified by gene co-expression network analysis (WGCNA). Three hub genes (RPL18, δ-COP, and EXLA2), ten transcription factors (TFs), one transcription regulator, and 12 protein kinases (PKs) were introduced as key genes. The results suggest that the identified PKs probably govern the coordination of cellular responses to abiotic stress in cucumber. Moreover, the C2H2 TF family may play a significant role in cucumber response to abiotic stress. Several C2H2 TF target stress-related genes were identified through co-expression and promoter analyses. Evaluation of the key identified genes using Random Forest, with an area under the curve of ROC (AUC) of 0.974 and an accuracy rate of 88.5%, demonstrates their prominent contributions in the cucumber response to abiotic stresses. These findings provide novel insights into the regulatory mechanism underlying abiotic stress response in cucumber and pave the way for cucumber genetic engineering toward improving tolerance ability under abiotic stress.
Collapse
Affiliation(s)
- Zahra Zinati
- Department of Agroecology, College of Agriculture and Natural Resources of Darab, Shiraz University, Shiraz, Iran.
| | - Leyla Nazari
- Crop and Horticultural Science Research Department, Fars Agricultural and Natural Resources Research and Education Center, Agricultural Research, Education and Extension Organization (AREEO), Shiraz, Iran.
| |
Collapse
|
16
|
Mallik S, Seth S, Si A, Bhadra T, Zhao Z. Optimal ranking and directional signature classification using the integral strategy of multi-objective optimization-based association rule mining of multi-omics data. FRONTIERS IN BIOINFORMATICS 2023; 3:1182176. [PMID: 37576714 PMCID: PMC10415913 DOI: 10.3389/fbinf.2023.1182176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 06/19/2023] [Indexed: 08/15/2023] Open
Abstract
Introduction: Association rule mining (ARM) is a powerful tool for exploring the informative relationships among multiple items (genes) in any dataset. The main problem of ARM is that it generates many rules containing different rule-informative values, which becomes a challenge for the user to choose the effective rules. In addition, few works have been performed on the integration of multiple biological datasets and variable cutoff values in ARM. Methods: To solve all these problems, in this article, we developed a novel framework MOOVARM (multi-objective optimized variable cutoff-based association rule mining) for multi-omics profiles. Results: In this regard, we identified the positive ideal solution (PIS), which maximized the profit and minimized the loss, and negative ideal solution (NIS), which minimized the profit and maximized the loss for all gene sets (item sets), belonging to each extracted rule. Thereafter, we computed the distance (d +) from PIS and distance (d -) from NIS for each gene set or product. These two distances played an important role in determining the optimized associations among various pairs of genes in the multi-omics dataset. We then globally estimated the relative closeness to PIS for ranking the gene sets. When the relative closeness score of the rule is greater than or equal to the pre-defined threshold value, the rule can be considered a final resultant rule. Moreover, MOOVARM evaluated the relative score of the rule based on the status of all genes instead of individual genes. Conclusions: MOOVARM produced the final rank of the extracted (multi-objective optimized) rules of correlated genes which had better disease classification than the state-of-the-art algorithms on gene signature identification.
Collapse
Affiliation(s)
- Saurav Mallik
- Environmental Health, Harvard T. H. Chan School of Public Health, Boston, MA, United States
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Soumita Seth
- Department of Computer Science and Engineering, Brainware University, Kolkata, India
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Amalendu Si
- School of Information Technology, Maulana Abul Kalam Azad University of Technology, Haringhata, India
| | - Tapas Bhadra
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, United States
- Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, United States
| |
Collapse
|
17
|
Olatunji I, Cui F. Multimodal AI for prediction of distant metastasis in carcinoma patients. FRONTIERS IN BIOINFORMATICS 2023; 3:1131021. [PMID: 37228671 PMCID: PMC10203594 DOI: 10.3389/fbinf.2023.1131021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2022] [Accepted: 04/24/2023] [Indexed: 05/27/2023] Open
Abstract
Metastasis of cancer is directly related to death in almost all cases, however a lot is yet to be understood about this process. Despite advancements in the available radiological investigation techniques, not all cases of Distant Metastasis (DM) are diagnosed at initial clinical presentation. Also, there are currently no standard biomarkers of metastasis. Early, accurate diagnosis of DM is however crucial for clinical decision making, and planning of appropriate management strategies. Previous works have achieved little success in attempts to predict DM from either clinical, genomic, radiology, or histopathology data. In this work we attempt a multimodal approach to predict the presence of DM in cancer patients by combining gene expression data, clinical data and histopathology images. We tested a novel combination of Random Forest (RF) algorithm with an optimization technique for gene selection, and investigated if gene expression pattern in the primary tissues of three cancer types (Bladder Carcinoma, Pancreatic Adenocarcinoma, and Head and Neck Squamous Carcinoma) with DM are similar or different. Gene expression biomarkers of DM identified by our proposed method outperformed Differentially Expressed Genes (DEGs) identified by the DESeq2 software package in the task of predicting presence or absence of DM. Genes involved in DM tend to be more cancer type specific rather than general across all cancers. Our results also indicate that multimodal data is more predictive of metastasis than either of the three unimodal data tested, and genomic data provides the highest contribution by a wide margin. The results re-emphasize the importance for availability of sufficient image data when a weakly supervised training technique is used. Code is made available at: https://github.com/rit-cui-lab/Multimodal-AI-for-Prediction-of-Distant-Metastasis-in-Carcinoma-Patients.
Collapse
|
18
|
Li W, Chi Y, Yu K, Xie W. A two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization. BMC Bioinformatics 2023; 24:130. [PMID: 37016297 PMCID: PMC10072044 DOI: 10.1186/s12859-023-05247-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 03/21/2023] [Indexed: 04/06/2023] Open
Abstract
BACKGROUND In the field of genomics and personalized medicine, it is a key issue to find biomarkers directly related to the diagnosis of specific diseases from high-throughput gene microarray data. Feature selection technology can discover biomarkers with disease classification information. RESULTS We use support vector machines as classifiers and use the five-fold cross-validation average classification accuracy, recall, precision and F1 score as evaluation metrics to evaluate the identified biomarkers. Experimental results show classification accuracy above 0.93, recall above 0.92, precision above 0.91, and F1 score above 0.94 on eight microarray datasets. METHOD This paper proposes a two-stage hybrid biomarker selection method based on ensemble filter and binary differential evolution incorporating binary African vultures optimization (EF-BDBA), which can effectively reduce the dimension of microarray data and obtain optimal biomarkers. In the first stage, we propose an ensemble filter feature selection method. The method combines an improved fast correlation-based filter algorithm with Fisher score. obviously redundant and irrelevant features can be filtered out to initially reduce the dimensionality of the microarray data. In the second stage, the optimal feature subset is selected using an improved binary differential evolution incorporating an improved binary African vultures optimization algorithm. The African vultures optimization algorithm has excellent global optimization ability. It has not been systematically applied to feature selection problems, especially for gene microarray data. We combine it with a differential evolution algorithm to improve population diversity. CONCLUSION Compared with traditional feature selection methods and advanced hybrid methods, the proposed method achieves higher classification accuracy and identifies excellent biomarkers while retaining fewer features. The experimental results demonstrate the effectiveness and advancement of our proposed algorithmic model.
Collapse
Affiliation(s)
- Wei Li
- Key Laboratory of Intelligent Computing in Medical Image (MIIC), Northeastern University, Ministry of Education, Shenyang, China
| | - Yuhuan Chi
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Kun Yu
- School of Biomedical and Information Engineering, Northeastern University, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang, China.
| |
Collapse
|
19
|
Omar M, Dinalankara W, Mulder L, Coady T, Zanettini C, Imada EL, Younes L, Geman D, Marchionni L. Using biological constraints to improve prediction in precision oncology. iScience 2023; 26:106108. [PMID: 36852282 PMCID: PMC9958363 DOI: 10.1016/j.isci.2023.106108] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 12/20/2022] [Accepted: 01/28/2023] [Indexed: 02/05/2023] Open
Abstract
Many gene signatures have been developed by applying machine learning (ML) on omics profiles, however, their clinical utility is often hindered by limited interpretability and unstable performance. Here, we show the importance of embedding prior biological knowledge in the decision rules yielded by ML approaches to build robust classifiers. We tested this by applying different ML algorithms on gene expression data to predict three difficult cancer phenotypes: bladder cancer progression to muscle-invasive disease, response to neoadjuvant chemotherapy in triple-negative breast cancer, and prostate cancer metastatic progression. We developed two sets of classifiers: mechanistic, by restricting the training to features capturing specific biological mechanisms; and agnostic, in which the training did not use any a priori biological information. Mechanistic models had a similar or better testing performance than their agnostic counterparts, with enhanced interpretability. Our findings support the use of biological constraints to develop robust gene signatures with high translational potential.
Collapse
Affiliation(s)
- Mohamed Omar
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Wikum Dinalankara
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Lotte Mulder
- Technical University Delft, 2628 CD Delft, the Netherlands
| | - Tendai Coady
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Claudio Zanettini
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Eddie Luidy Imada
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| | - Laurent Younes
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Donald Geman
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Luigi Marchionni
- Department of Pathology and Laboratory Medicine, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
20
|
Fu Y, Si A, Wei X, Lin X, Ma Y, Qiu H, Guo Z, Pan Y, Zhang Y, Kong X, Li S, Shi Y, Wu H. Combining a machine-learning derived 4-lncRNA signature with AFP and TNM stages in predicting early recurrence of hepatocellular carcinoma. BMC Genomics 2023; 24:89. [PMID: 36849926 PMCID: PMC9972730 DOI: 10.1186/s12864-023-09194-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 02/17/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Near 70% of hepatocellular carcinoma (HCC) recurrence is early recurrence within 2-year post surgery. Long non-coding RNAs (lncRNAs) are intensively involved in HCC progression and serve as biomarkers for HCC prognosis. The aim of this study is to construct a lncRNA-based signature for predicting HCC early recurrence. METHODS Data of RNA expression and associated clinical information were accessed from The Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA-LIHC) database. Recurrence associated differentially expressed lncRNAs (DELncs) were determined by three DEG methods and two survival analyses methods. DELncs involved in the signature were selected by three machine learning methods and multivariate Cox analysis. Additionally, the signature was validated in a cohort of HCC patients from an external source. In order to gain insight into the biological functions of this signature, gene sets enrichment analyses, immune infiltration analyses, as well as immune and drug therapy prediction analyses were conducted. RESULTS A 4-lncRNA signature consisting of AC108463.1, AF131217.1, CMB9-22P13.1, TMCC1-AS1 was constructed. Patients in the high-risk group showed significantly higher early recurrence rate compared to those in the low-risk group. Combination of the signature, AFP and TNM further improved the early HCC recurrence predictive performance. Several molecular pathways and gene sets associated with HCC pathogenesis are enriched in the high-risk group. Antitumor immune cells, such as activated B cell, type 1 T helper cell, natural killer cell and effective memory CD8 T cell are enriched in patients with low-risk HCCs. HCC patients in the low- and high-risk group had differential sensitivities to various antitumor drugs. Finally, predictive performance of this signature was validated in an external cohort of patients with HCC. CONCLUSION Combined with TNM and AFP, the 4-lncRNA signature presents excellent predictability of HCC early recurrence.
Collapse
Affiliation(s)
- Yi Fu
- grid.507037.60000 0004 1764 1277Shanghai Key Laboratory of Molecular Imaging, Zhoupu Hospital, Shanghai University of Medicine and Health Sciences, Shanghai, China ,grid.507037.60000 0004 1764 1277Collaborative Innovation Center for Biomedicines, Shanghai University of Medicine and Health Sciences, Shanghai, China ,grid.507037.60000 0004 1764 1277School of Medical Instruments, Shanghai University of Medicine and Health Sciences, Shanghai, China
| | - Anfeng Si
- grid.41156.370000 0001 2314 964XDepartment of Surgical Oncology, Jinling Hospital, Medical School of Nanjing University, Nanjing, China
| | - Xindong Wei
- grid.412585.f0000 0004 0604 8558Central Laboratory, Department of Liver Diseases, Shuguang Hospital, Shanghai University of Chinese Traditional Medicine, Shanghai, China
| | - Xinjie Lin
- grid.507037.60000 0004 1764 1277Shanghai Key Laboratory of Molecular Imaging, Zhoupu Hospital, Shanghai University of Medicine and Health Sciences, Shanghai, China ,grid.507037.60000 0004 1764 1277Collaborative Innovation Center for Biomedicines, Shanghai University of Medicine and Health Sciences, Shanghai, China
| | - Yujie Ma
- grid.507037.60000 0004 1764 1277Shanghai Key Laboratory of Molecular Imaging, Zhoupu Hospital, Shanghai University of Medicine and Health Sciences, Shanghai, China ,grid.507037.60000 0004 1764 1277Collaborative Innovation Center for Biomedicines, Shanghai University of Medicine and Health Sciences, Shanghai, China
| | - Huimin Qiu
- grid.507037.60000 0004 1764 1277Collaborative Innovation Center for Biomedicines, Shanghai University of Medicine and Health Sciences, Shanghai, China ,grid.267139.80000 0000 9188 055XSchool of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Zhinan Guo
- grid.507037.60000 0004 1764 1277Collaborative Innovation Center for Biomedicines, Shanghai University of Medicine and Health Sciences, Shanghai, China ,grid.412543.50000 0001 0033 4148School of Kinesiology, Shanghai University of Sport, Shanghai, China
| | - Yong Pan
- grid.268099.c0000 0001 0348 3990Department of Infectious Disease, Zhoushan Hospital, Wenzhou Medical University, Zhoushan, China
| | - Yiru Zhang
- grid.268099.c0000 0001 0348 3990Department of Infectious Disease, Zhoushan Hospital, Wenzhou Medical University, Zhoushan, China
| | - Xiaoni Kong
- grid.412585.f0000 0004 0604 8558Central Laboratory, Department of Liver Diseases, Shuguang Hospital, Shanghai University of Chinese Traditional Medicine, Shanghai, China
| | - Shibo Li
- Department of Infectious Disease, Zhoushan Hospital, Wenzhou Medical University, Zhoushan, China.
| | - Yanjun Shi
- Abdominal Transplantation Center, General Surgery, School of Medicine, Ruijin Hospital, Shanghai Jiao Tong University, Shanghai, China.
| | - Hailong Wu
- Shanghai Key Laboratory of Molecular Imaging, Zhoupu Hospital, Shanghai University of Medicine and Health Sciences, Shanghai, China. .,Collaborative Innovation Center for Biomedicines, Shanghai University of Medicine and Health Sciences, Shanghai, China. .,School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China. .,School of Kinesiology, Shanghai University of Sport, Shanghai, China.
| |
Collapse
|
21
|
Identification of TRPC6 as a Novel Diagnostic Biomarker of PM-Induced Chronic Obstructive Pulmonary Disease Using Machine Learning Models. Genes (Basel) 2023; 14:genes14020284. [PMID: 36833211 PMCID: PMC9957274 DOI: 10.3390/genes14020284] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Revised: 01/18/2023] [Accepted: 01/19/2023] [Indexed: 01/26/2023] Open
Abstract
Chronic obstructive pulmonary disease (COPD) was the third most prevalent cause of mortality worldwide in 2010; it results from a progressive and fatal deterioration of lung function because of cigarette smoking and particulate matter (PM). Therefore, it is important to identify molecular biomarkers that can diagnose the COPD phenotype to plan therapeutic efficacy. To identify potential novel biomarkers of COPD, we first obtained COPD and the normal lung tissue gene expression dataset GSE151052 from the NCBI Gene Expression Omnibus (GEO). A total of 250 differentially expressed genes (DEGs) were investigated and analyzed using GEO2R, gene ontology (GO) functional annotation, and Kyoto Encyclopedia of Genes and Genomes (KEGG) identification. The GEO2R analysis revealed that TRPC6 was the sixth most highly expressed gene in patients with COPD. The GO analysis indicated that the upregulated DEGs were mainly concentrated in the plasma membrane, transcription, and DNA binding. The KEGG pathway analysis indicated that the upregulated DEGs were mainly involved in pathways related to cancer and axon guidance. TRPC6, one of the most abundant genes among the top 10 differentially expressed total RNAs (fold change ≥ 1.5) between the COPD and normal groups, was selected as a novel COPD biomarker based on the results of the GEO dataset and analysis using machine learning models. The upregulation of TRPC6 was verified in PM-stimulated RAW264.7 cells, which mimicked COPD conditions, compared to untreated RAW264.7 cells by a quantitative reverse transcription polymerase chain reaction. In conclusion, our study suggests that TRPC6 can be regarded as a potential novel biomarker for COPD pathogenesis.
Collapse
|
22
|
Li D, Liang H, Qin P, Wang J. A self-training subspace clustering algorithm based on adaptive confidence for gene expression data. Front Genet 2023; 14:1132370. [PMID: 37025450 PMCID: PMC10070828 DOI: 10.3389/fgene.2023.1132370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Accepted: 03/07/2023] [Indexed: 04/08/2023] Open
Abstract
Gene clustering is one of the important techniques to identify co-expressed gene groups from gene expression data, which provides a powerful tool for investigating functional relationships of genes in biological process. Self-training is a kind of important semi-supervised learning method and has exhibited good performance on gene clustering problem. However, the self-training process inevitably suffers from mislabeling, the accumulation of which will lead to the degradation of semi-supervised learning performance of gene expression data. To solve the problem, this paper proposes a self-training subspace clustering algorithm based on adaptive confidence for gene expression data (SSCAC), which combines the low-rank representation of gene expression data and adaptive adjustment of label confidence to better guide the partition of unlabeled data. The superiority of the proposed SSCAC algorithm is mainly reflected in the following aspects. 1) In order to improve the discriminative property of gene expression data, the low-rank representation with distance penalty is used to mine the potential subspace structure of data. 2) Considering the problem of mislabeling in self-training, a semi-supervised clustering objective function with label confidence is proposed, and a self-training subspace clustering framework is constructed on this basis. 3) In order to mitigate the negative impact of mislabeled data, an adaptive adjustment strategy based on gravitational search algorithm is proposed for label confidence. Compared with a variety of state-of-the-art unsupervised and semi-supervised learning algorithms, the SSCAC algorithm has demonstrated its superiority through extensive experiments on two benchmark gene expression datasets.
Collapse
Affiliation(s)
- Dan Li
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Hongnan Liang
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Pan Qin
- Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, Liaoning, China
- *Correspondence: Pan Qin, ; Jia Wang,
| | - Jia Wang
- Department of Breast Surgery, The Second Hospital of Dalian Medical University, Dalian, Liaoning, China
- *Correspondence: Pan Qin, ; Jia Wang,
| |
Collapse
|
23
|
Parkinson E, Liberatore F, Watkins WJ, Andrews R, Edkins S, Hibbert J, Strunk T, Currie A, Ghazal P. Gene filtering strategies for machine learning guided biomarker discovery using neonatal sepsis RNA-seq data. Front Genet 2023; 14:1158352. [PMID: 37113992 PMCID: PMC10126415 DOI: 10.3389/fgene.2023.1158352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 03/29/2023] [Indexed: 04/29/2023] Open
Abstract
Machine learning (ML) algorithms are powerful tools that are increasingly being used for sepsis biomarker discovery in RNA-Seq data. RNA-Seq datasets contain multiple sources and types of noise (operator, technical and non-systematic) that may bias ML classification. Normalisation and independent gene filtering approaches described in RNA-Seq workflows account for some of this variability and are typically only targeted at differential expression analysis rather than ML applications. Pre-processing normalisation steps significantly reduce the number of variables in the data and thereby increase the power of statistical testing, but can potentially discard valuable and insightful classification features. A systematic assessment of applying transcript level filtering on the robustness and stability of ML based RNA-seq classification remains to be fully explored. In this report we examine the impact of filtering out low count transcripts and those with influential outliers read counts on downstream ML analysis for sepsis biomarker discovery using elastic net regularised logistic regression, L1-reguarlised support vector machines and random forests. We demonstrate that applying a systematic objective strategy for removal of uninformative and potentially biasing biomarkers representing up to 60% of transcripts in different sample size datasets, including two illustrative neonatal sepsis cohorts, leads to substantial improvements in classification performance, higher stability of the resulting gene signatures, and better agreement with previously reported sepsis biomarkers. We also demonstrate that the performance uplift from gene filtering depends on the ML classifier chosen, with L1-regularlised support vector machines showing the greatest performance improvements with our experimental data.
Collapse
Affiliation(s)
- Edward Parkinson
- Department of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom
- *Correspondence: Edward Parkinson,
| | - Federico Liberatore
- Department of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom
| | - W. John Watkins
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| | - Robert Andrews
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| | - Sarah Edkins
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| | - Julie Hibbert
- Wesfarmers Centre of Vaccines and Infectious Diseases, Telethon Kids Institute, Perth, WA, Australia
- Medical School, University of Western Australia, Perth, WA, Australia
- Centre for Molecular Medicine and Innovative Therapeutics, Murdoch University, Perth, WA, Australia
| | - Tobias Strunk
- Wesfarmers Centre of Vaccines and Infectious Diseases, Telethon Kids Institute, Perth, WA, Australia
- Medical School, University of Western Australia, Perth, WA, Australia
- Neonatal Directorate, Child and Adolescent Health Service, Perth, WA, Australia
| | - Andrew Currie
- Wesfarmers Centre of Vaccines and Infectious Diseases, Telethon Kids Institute, Perth, WA, Australia
- Centre for Molecular Medicine and Innovative Therapeutics, Murdoch University, Perth, WA, Australia
| | - Peter Ghazal
- Project Sepsis, Systems Immunity Research Institute, Cardiff University, Cardiff, United Kingdom
| |
Collapse
|
24
|
Singh T, Malik G, Someshwar S, Le HTT, Polavarapu R, Chavali LN, Melethadathil N, Sundararajan VS, Valadi J, Kavi Kishor PB, Suravajhala P. Machine Learning Heuristics on Gingivobuccal Cancer Gene Datasets Reveals Key Candidate Attributes for Prognosis. Genes (Basel) 2022; 13:genes13122379. [PMID: 36553647 PMCID: PMC9777687 DOI: 10.3390/genes13122379] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 11/28/2022] [Accepted: 12/07/2022] [Indexed: 12/23/2022] Open
Abstract
Delayed cancer detection is one of the common causes of poor prognosis in the case of many cancers, including cancers of the oral cavity. Despite the improvement and development of new and efficient gene therapy treatments, very little has been carried out to algorithmically assess the impedance of these carcinomas. In this work, from attributes or NCBI's oral cancer datasets, viz. (i) name, (ii) gene(s), (iii) protein change, (iv) condition(s), clinical significance (last reviewed). We sought to train the number of instances emerging from them. Further, we attempt to annotate viable attributes in oral cancer gene datasets for the identification of gingivobuccal cancer (GBC). We further apply supervised and unsupervised machine learning methods to the gene datasets, revealing key candidate attributes for GBC prognosis. Our work highlights the importance of automated identification of key genes responsible for GBC that could perhaps be easily replicated in other forms of oral cancer detection.
Collapse
Affiliation(s)
| | - Girik Malik
- Bioclues.org, Hyderabad 500072, India
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | | | - Hien Thi Thu Le
- Molecular Signaling Lab, Faculty of Medicine & Health Technology, Tampere University, 33100 Tampere, Finland
| | - Rathnagiri Polavarapu
- Amity Institute of Microbial Technology, Amity University, SP-1 Kant Kalwar, NH11C, RIICO Industrial Area, Rajasthan 303002, India
| | | | | | | | - Jayaraman Valadi
- Bioclues.org, Hyderabad 500072, India
- Department of Computer Science, FLAME University, Pune 412115, India
| | - P. B. Kavi Kishor
- MNR Foundation for Research & Innovation, MNR Medical College and Hospital, Fasalwadi, Sangareddy, Hyderabad 502294, India
| | - Prashanth Suravajhala
- Bioclues.org, Hyderabad 500072, India
- Amrita School of Biotechnology, Amrita Vishwa Vidyapeetham, Clappana 690525, India
- Correspondence:
| |
Collapse
|
25
|
Qiu F, Zheng P, Heidari AA, Liang G, Chen H, Karim FK, Elmannai H, Lin H. Mutational Slime Mould Algorithm for Gene Selection. Biomedicines 2022; 10:2052. [PMID: 36009599 PMCID: PMC9406076 DOI: 10.3390/biomedicines10082052] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/14/2022] [Accepted: 08/16/2022] [Indexed: 02/02/2023] Open
Abstract
A large volume of high-dimensional genetic data has been produced in modern medicine and biology fields. Data-driven decision-making is particularly crucial to clinical practice and relevant procedures. However, high-dimensional data in these fields increase the processing complexity and scale. Identifying representative genes and reducing the data's dimensions is often challenging. The purpose of gene selection is to eliminate irrelevant or redundant features to reduce the computational cost and improve classification accuracy. The wrapper gene selection model is based on a feature set, which can reduce the number of features and improve classification accuracy. This paper proposes a wrapper gene selection method based on the slime mould algorithm (SMA) to solve this problem. SMA is a new algorithm with a lot of application space in the feature selection field. This paper improves the original SMA by combining the Cauchy mutation mechanism with the crossover mutation strategy based on differential evolution (DE). Then, the transfer function converts the continuous optimizer into a binary version to solve the gene selection problem. Firstly, the continuous version of the method, ISMA, is tested on 33 classical continuous optimization problems. Then, the effect of the discrete version, or BISMA, was thoroughly studied by comparing it with other gene selection methods on 14 gene expression datasets. Experimental results show that the continuous version of the algorithm achieves an optimal balance between local exploitation and global search capabilities, and the discrete version of the algorithm has the highest accuracy when selecting the least number of genes.
Collapse
Affiliation(s)
- Feng Qiu
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
| | - Pan Zheng
- Information Systems, University of Canterbury, Christchurch 8014, New Zealand
| | - Ali Asghar Heidari
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
| | - Guoxi Liang
- Department of Information Technology, Wenzhou Polytechnic, Wenzhou 325035, China
| | - Huiling Chen
- Department of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
| | - Faten Khalid Karim
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Hela Elmannai
- Department of Information Technology, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Haiping Lin
- Department of Information Engineering, Hangzhou Vocational & Technical College, Hangzhou 310018, China
| |
Collapse
|
26
|
A novel biomarker selection method combining graph neural network and gene relationships applied to microarray data. BMC Bioinformatics 2022; 23:303. [PMID: 35883022 PMCID: PMC9327232 DOI: 10.1186/s12859-022-04848-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 07/15/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The discovery of critical biomarkers is significant for clinical diagnosis, drug research and development. Researchers usually obtain biomarkers from microarray data, which comes from the dimensional curse. Feature selection in machine learning is usually used to solve this problem. However, most methods do not fully consider feature dependence, especially the real pathway relationship of genes. RESULTS Experimental results show that the proposed method is superior to classical algorithms and advanced methods in feature number and accuracy, and the selected features have more significance. METHOD This paper proposes a feature selection method based on a graph neural network. The proposed method uses the actual dependencies between features and the Pearson correlation coefficient to construct graph-structured data. The information dissemination and aggregation operations based on graph neural network are applied to fuse node information on graph structured data. The redundant features are clustered by the spectral clustering method. Then, the feature ranking aggregation model using eight feature evaluation methods acts on each clustering sub-cluster for different feature selection. CONCLUSION The proposed method can effectively remove redundant features. The algorithm's output has high stability and classification accuracy, which can potentially select potential biomarkers.
Collapse
|
27
|
Mahendran N, Vincent PMDR, Srinivasan K, Chang CY. Improving the Classification of Alzheimer's Disease Using Hybrid Gene Selection Pipeline and Deep Learning. Front Genet 2021; 12:784814. [PMID: 34868275 PMCID: PMC8632950 DOI: 10.3389/fgene.2021.784814] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 10/20/2021] [Indexed: 11/13/2022] Open
Abstract
Alzheimer’s is a progressive, irreversible, neurodegenerative brain disease. Even with prominent symptoms, it takes years to notice, decode, and reveal Alzheimer’s. However, advancements in technologies, such as imaging techniques, help in early diagnosis. Still, sometimes the results are inaccurate, which delays the treatment. Thus, the research in recent times focused on identifying the molecular biomarkers that differentiate the genotype and phenotype characteristics. However, the gene expression dataset’s generated features are huge, 1,000 or even more than 10,000. To overcome such a curse of dimensionality, feature selection techniques are introduced. We designed a gene selection pipeline combining a filter, wrapper, and unsupervised method to select the relevant genes. We combined the minimum Redundancy and maximum Relevance (mRmR), Wrapper-based Particle Swarm Optimization (WPSO), and Auto encoder to select the relevant features. We used the GSE5281 Alzheimer’s dataset from the Gene Expression Omnibus We implemented an Improved Deep Belief Network (IDBN) with simple stopping criteria after choosing the relevant genes. We used a Bayesian Optimization technique to tune the hyperparameters in the Improved Deep Belief Network. The tabulated results show that the proposed pipeline shows promising results.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P M Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Yunlin, Taiwan
| |
Collapse
|
28
|
Mori Y, Yokota H, Hoshino I, Iwatate Y, Wakamatsu K, Uno T, Suyari H. Deep learning-based gene selection in comprehensive gene analysis in pancreatic cancer. Sci Rep 2021; 11:16521. [PMID: 34389782 PMCID: PMC8363643 DOI: 10.1038/s41598-021-95969-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2021] [Accepted: 07/29/2021] [Indexed: 12/14/2022] Open
Abstract
The selection of genes that are important for obtaining gene expression data is challenging. Here, we developed a deep learning-based feature selection method suitable for gene selection. Our novel deep learning model includes an additional feature-selection layer. After model training, the units in this layer with high weights correspond to the genes that worked effectively in the processing of the networks. Cancer tissue samples and adjacent normal pancreatic tissue samples were collected from 13 patients with pancreatic ductal adenocarcinoma during surgery and subsequently frozen. After processing, gene expression data were extracted from the specimens using RNA sequencing. Task 1 for the model training was to discriminate between cancerous and normal pancreatic tissue in six patients. Task 2 was to discriminate between patients with pancreatic cancer (n = 13) who survived for more than one year after surgery. The most frequently selected genes were ACACB, ADAMTS6, NCAM1, and CADPS in Task 1, and CD1D, PLA2G16, DACH1, and SOWAHA in Task 2. According to The Cancer Genome Atlas dataset, these genes are all prognostic factors for pancreatic cancer. Thus, the feasibility of using our deep learning-based method for the selection of genes associated with pancreatic cancer development and prognosis was confirmed.
Collapse
Affiliation(s)
- Yasukuni Mori
- Graduate School of Engineering, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba-shi, Chiba, 263-8522, Japan.
| | - Hajime Yokota
- Department of Diagnostic Radiology and Radiation Oncology, Graduate School of Medicine, Chiba University, 1-8-1 Inohana, Chuo-ku, Chiba-shi, Chiba, 260-8670, Japan
| | - Isamu Hoshino
- Division of Gastroenterological Surgery, Chiba Cancer Center, 666-2 Nitona-cho, Chuo-ku, Chiba-shi, Chiba, 260-8717, Japan
| | - Yosuke Iwatate
- Division of Hepato-Biliary-Pancreatic Surgery, Chiba Cancer Center, 666-2 Nitona-cho, Chuo-ku, Chiba-shi, Chiba, 260-8717, Japan
| | - Kohei Wakamatsu
- Media Data Tech Studio, CyberAgent, Inc., 13F Akihabara Daibiru, 1-18-13 Sotokanda, Chiyoda-ku, Tokyo, 101-0021, Japan
| | - Takashi Uno
- Department of Diagnostic Radiology and Radiation Oncology, Graduate School of Medicine, Chiba University, 1-8-1 Inohana, Chuo-ku, Chiba-shi, Chiba, 260-8670, Japan
| | - Hiroki Suyari
- Graduate School of Engineering, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba-shi, Chiba, 263-8522, Japan
| |
Collapse
|
29
|
Mahmoudian M, Venäläinen MS, Klén R, Elo LL. Stable Iterative Variable Selection. Bioinformatics 2021; 37:4810-4817. [PMID: 34270690 PMCID: PMC8665768 DOI: 10.1093/bioinformatics/btab501] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Revised: 05/20/2021] [Accepted: 07/14/2021] [Indexed: 11/13/2022] Open
Abstract
Motivation The emergence of datasets with tens of thousands of features, such as high-throughput omics biomedical data, highlights the importance of reducing the feature space into a distilled subset that can truly capture the signal for research and industry by aiding in finding more effective biomarkers for the question in hand. A good feature set also facilitates building robust predictive models with improved interpretability and convergence of the applied method due to the smaller feature space. Results Here, we present a robust feature selection method named Stable Iterative Variable Selection (SIVS) and assess its performance over both omics and clinical data types. As a performance assessment metric, we compared the number and goodness of the selected feature using SIVS to those selected by Least Absolute Shrinkage and Selection Operator regression. The results suggested that the feature space selected by SIVS was, on average, 41% smaller, without having a negative effect on the model performance. A similar result was observed for comparison with Boruta and caret RFE. Availability and implementation The method is implemented as an R package under GNU General Public License v3.0 and is accessible via Comprehensive R Archive Network (CRAN) via https://cran.r-project.org/package=sivs. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mehrad Mahmoudian
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.,Department of Future Technologies, University of Turku, Turku, Finland
| | - Mikko S Venäläinen
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Riku Klén
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland
| | - Laura L Elo
- Turku Bioscience Centre, University of Turku and Åbo Akademi University, Turku, Finland.,Institute of Biomedicine, University of Turku, Turku, Finland
| |
Collapse
|
30
|
Zhang Z, van Dijk F, de Klein N, van Gijn ME, Franke LH, Sinke RJ, Swertz MA, van der Velde KJ. Feasibility of predicting allele specific expression from DNA sequencing using machine learning. Sci Rep 2021; 11:10606. [PMID: 34012022 PMCID: PMC8134421 DOI: 10.1038/s41598-021-89904-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 05/04/2021] [Indexed: 11/09/2022] Open
Abstract
Allele specific expression (ASE) concerns divergent expression quantity of alternative alleles and is measured by RNA sequencing. Multiple studies show that ASE plays a role in hereditary diseases by modulating penetrance or phenotype severity. However, genome diagnostics is based on DNA sequencing and therefore neglects gene expression regulation such as ASE. To take advantage of ASE in absence of RNA sequencing, it must be predicted using only DNA variation. We have constructed ASE models from BIOS (n = 3432) and GTEx (n = 369) that predict ASE using DNA features. These models are highly reproducible and comprise many different feature types, highlighting the complex regulation that underlies ASE. We applied the BIOS-trained model to population variants in three genes in which ASE plays a clinically relevant role: BRCA2, RET and NF1. This resulted in predicted ASE effects for 27 variants, of which 10 were known pathogenic variants. We demonstrated that ASE can be predicted from DNA features using machine learning. Future efforts may improve sensitivity and translate these models into a new type of genome diagnostic tool that prioritizes candidate pathogenic variants or regulators thereof for follow-up validation by RNA sequencing. All used code and machine learning models are available at GitHub and Zenodo.
Collapse
Affiliation(s)
- Zhenhua Zhang
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Freerk van Dijk
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Prinses Maxima Center for Child Oncology, Heidelberglaan 25, 3584 CS, Utrecht, The Netherlands
| | - Niek de Klein
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Mariëlle E van Gijn
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Lude H Franke
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Richard J Sinke
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - Morris A Swertz
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands
| | - K Joeri van der Velde
- Genomics Coordination Center, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands.
- Department of Genetics, University of Groningen and University Medical Center Groningen, Antonius Deusinglaan 1, 9713 AV, Groningen, The Netherlands.
| |
Collapse
|