1
|
Sahoo K, Sundararajan V. Methods in DNA methylation array dataset analysis: A review. Comput Struct Biotechnol J 2024; 23:2304-2325. [PMID: 38845821 PMCID: PMC11153885 DOI: 10.1016/j.csbj.2024.05.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 04/25/2024] [Accepted: 05/08/2024] [Indexed: 06/09/2024] Open
Abstract
Understanding the intricate relationships between gene expression levels and epigenetic modifications in a genome is crucial to comprehending the pathogenic mechanisms of many diseases. With the advancement of DNA Methylome Profiling techniques, the emphasis on identifying Differentially Methylated Regions (DMRs/DMGs) has become crucial for biomarker discovery, offering new insights into the etiology of illnesses. This review surveys the current state of computational tools/algorithms for the analysis of microarray-based DNA methylation profiling datasets, focusing on key concepts underlying the diagnostic/prognostic CpG site extraction. It addresses methodological frameworks, algorithms, and pipelines employed by various authors, serving as a roadmap to address challenges and understand changing trends in the methodologies for analyzing array-based DNA methylation profiling datasets derived from diseased genomes. Additionally, it highlights the importance of integrating gene expression and methylation datasets for accurate biomarker identification, explores prognostic prediction models, and discusses molecular subtyping for disease classification. The review also emphasizes the contributions of machine learning, neural networks, and data mining to enhance diagnostic workflow development, thereby improving accuracy, precision, and robustness.
Collapse
Affiliation(s)
| | - Vino Sundararajan
- Correspondence to: Department of Bio Sciences, School of Bio Sciences and Technology, Vellore Institute of Technology, Vellore 632 014, Tamil Nadu, India.
| |
Collapse
|
2
|
Augustine J, Jereesh AS. Identification of gene-level methylation for disease prediction. Interdiscip Sci 2023; 15:678-695. [PMID: 37603212 DOI: 10.1007/s12539-023-00584-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Revised: 07/30/2023] [Accepted: 08/01/2023] [Indexed: 08/22/2023]
Abstract
DNA methylation is an epigenetic alteration that plays a fundamental part in governing gene regulatory processes. The DNA methylation mechanism affixes methyl groups to distinct cytosine residues, influencing chromatin architectures. Multiple studies have demonstrated that DNA methylation's regulatory effect on genes is linked to the beginning and progression of several disorders. Researchers have recently uncovered thousands of phenotype-related methylation sites through the epigenome-wide association study (EWAS). However, combining the methylation levels of several sites within a gene and determining the gene-level DNA methylation remains challenging. In this study, we proposed the supervised UMAP Assisted Gene-level Methylation method (sUAGM) for disease prediction based on supervised UMAP (Uniform Manifold Approximation and Projection), a manifold learning-based method for reducing dimensionality. The methylation values at the gene level generated using the proposed method are evaluated by employing various feature selection and classification algorithms on three distinct DNA methylation datasets derived from blood samples. The performance has been assessed employing classification accuracy, F-1 score, Mathews Correlation Coefficient (MCC), Kappa, Classification Success Index (CSI) and Jaccard Index. The Support Vector Machine with the linear kernel (SVML) classifier with Recursive Feature Elimination (RFE) performs best across all three datasets. From comparative analysis, our method outperformed existing gene-level and site-level approaches by achieving 100% accuracy and F1-score with fewer genes. The functional analysis of the top 28 genes selected from the Parkinson's disease dataset revealed a significant association with the disease.
Collapse
Affiliation(s)
- Jisha Augustine
- Bioinformatics Lab, Department of Computer Science, Cochin University of Science and Technology, Cochin, Kerala, 682022, India.
| | - A S Jereesh
- Bioinformatics Lab, Department of Computer Science, Cochin University of Science and Technology, Cochin, Kerala, 682022, India
| |
Collapse
|
3
|
Deep-Learning Algorithm and Concomitant Biomarker Identification for NSCLC Prediction Using Multi-Omics Data Integration. Biomolecules 2022; 12:biom12121839. [PMID: 36551266 PMCID: PMC9775093 DOI: 10.3390/biom12121839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 12/05/2022] [Accepted: 12/05/2022] [Indexed: 12/14/2022] Open
Abstract
Early diagnosis of lung cancer to increase the survival rate, which is currently at a low range of mid-30%, remains a critical need. Despite this, multi-omics data have rarely been applied to non-small-cell lung cancer (NSCLC) diagnosis. We developed a multi-omics data-affinitive artificial intelligence algorithm based on the graph convolutional network that integrates mRNA expression, DNA methylation, and DNA sequencing data. This NSCLC prediction model achieved a 93.7% macro F1-score, indicating that values for false positives and negatives were substantially low, which is desirable for accurate classification. Gene ontology enrichment and pathway analysis of features revealed that two major subtypes of NSCLC, lung adenocarcinoma and lung squamous cell carcinoma, have both specific and common GO biological processes. Numerous biomarkers (i.e., microRNA, long non-coding RNA, differentially methylated regions) were newly identified, whereas some biomarkers were consistent with previous findings in NSCLC (e.g., SPRR1B). Thus, using multi-omics data integration, we developed a promising cancer prediction algorithm.
Collapse
|
4
|
Qiu WR, Qi BB, Lin WZ, Zhang SH, Yu WK, Huang SF. Predicting the Lung Adenocarcinoma and Its Biomarkers by Integrating Gene Expression and DNA Methylation Data. Front Genet 2022; 13:926927. [PMID: 35846148 PMCID: PMC9280023 DOI: 10.3389/fgene.2022.926927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 06/13/2022] [Indexed: 11/17/2022] Open
Abstract
The early symptoms of lung adenocarcinoma patients are inapparent, and the clinical diagnosis of lung adenocarcinoma is primarily through X-ray examination and pathological section examination, whereas the discovery of biomarkers points out another direction for the diagnosis of lung adenocarcinoma with the development of bioinformatics technology. However, it is not accurate and trustworthy to diagnose lung adenocarcinoma due to omics data with high-dimension and low-sample size (HDLSS) features or biomarkers produced by utilizing only single omics data. To address the above problems, the feature selection methods of biological analysis are used to reduce the dimension of gene expression data (GSE19188) and DNA methylation data (GSE139032, GSE49996). In addition, the Cartesian product method is used to expand the sample set and integrate gene expression data and DNA methylation data. The classification is built by using a deep neural network and is evaluated on K-fold cross validation. Moreover, gene ontology analysis and literature retrieving are used to analyze the biological relevance of selected genes, TCGA database is used for survival analysis of these potential genes through Kaplan-Meier estimates to discover the detailed molecular mechanism of lung adenocarcinoma. Survival analysis shows that COL5A2 and SERPINB5 are significant for identifying lung adenocarcinoma and are considered biomarkers of lung adenocarcinoma.
Collapse
Affiliation(s)
- Wang-Ren Qiu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen, China
- *Correspondence: Wang-Ren Qiu, ; Shun-Fa Huang,
| | - Bei-Bei Qi
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen, China
| | - Wei-Zhong Lin
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen, China
| | - Shou-Hua Zhang
- Department of General Surgery, Jiangxi Provincial Children’s Hospital, Nanchang, China
| | - Wang-Ke Yu
- Computer Department, Jing-De-Zhen Ceramic Institute, Jingdezhen, China
| | - Shun-Fa Huang
- School of Information Engineering, Jingdezhen University, Jingdezhen, China
- *Correspondence: Wang-Ren Qiu, ; Shun-Fa Huang,
| |
Collapse
|
5
|
Chen Q, Wang Y, Liu Y, Xi B. ESRRG, ATP4A, and ATP4B as Diagnostic Biomarkers for Gastric Cancer: A Bioinformatic Analysis Based on Machine Learning. Front Physiol 2022; 13:905523. [PMID: 35812327 PMCID: PMC9262247 DOI: 10.3389/fphys.2022.905523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 05/10/2022] [Indexed: 11/13/2022] Open
Abstract
Based on multiple bioinformatics methods and machine learning techniques, this study was designed to explore potential hub genes of gastric cancer with a diagnostic value. The novel biomarkers were detected through multiple databases of gastric cancer–related genes. The NCBI Gene Expression Omnibus (GEO) database was used to obtain gene expression files. Three hub genes (ESRRG, ATP4A, and ATP4B) were detected through a combination of weighted gene co-expression network analysis (WGCNA), gene–gene interaction network analysis, and supervised feature selection method. GEPIA2 was used to verify the differences in the expression levels of the hub genes in normal and cancer tissues in the RNA-seq levels of Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA) databases. The objectivity of potential hub genes was also verified by immunohistochemistry in the Human Protein Atlas (HPA) database and transcription factor–hub gene regulatory network. Machine learning (ML) methods including data pre-processing, model selection and cross-validation, and performance evaluation were examined on the hub-gene expression profiles in five Gene Expression Omnibus datasets and verified on a GEO external validation (EV) dataset. Six supervised learning models (support vector machine, random forest, k-nearest neighbors, neural network, decision tree, and eXtreme Gradient Boosting) and one semi-supervised learning model (label spreading) were established to evaluate the diagnostic value of biomarkers. Among the six supervised models, the support vector machine (SVM) algorithm was the most effective one according to calculated performance metrics, including 0.93 and 0.99 area under the curve (AUC) scores on the test and external validation datasets, respectively. Furthermore, the semi-supervised model could also successfully learn and predict sample types, achieving a 0.986 AUC score on the EV dataset, even when 10% samples in the five GEO datasets were labeled. In conclusion, three hub genes (ATP4A, ATP4B, and ESRRG) closely related to gastric cancer were mined, based on which the ML diagnostic model of gastric cancer was conducted.
Collapse
Affiliation(s)
- Qiu Chen
- Medical College, Yangzhou University, Yangzhou, China
| | - Yu Wang
- College of Physics Science and Technology, Yangzhou University, Yangzhou, China
| | - Yongjun Liu
- College of Physics Science and Technology, Yangzhou University, Yangzhou, China
| | - Bin Xi
- College of Physics Science and Technology, Yangzhou University, Yangzhou, China
- *Correspondence: Bin Xi,
| |
Collapse
|
6
|
Arslan E, Schulz J, Rai K. Machine Learning in Epigenomics: Insights into Cancer Biology and Medicine. Biochim Biophys Acta Rev Cancer 2021; 1876:188588. [PMID: 34245839 PMCID: PMC8595561 DOI: 10.1016/j.bbcan.2021.188588] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 05/29/2021] [Accepted: 07/02/2021] [Indexed: 02/01/2023]
Abstract
The recent deluge of genome-wide technologies for the mapping of the epigenome and resulting data in cancer samples has provided the opportunity for gaining insights into and understanding the roles of epigenetic processes in cancer. However, the complexity, high-dimensionality, sparsity, and noise associated with these data pose challenges for extensive integrative analyses. Machine Learning (ML) algorithms are particularly suited for epigenomic data analyses due to their flexibility and ability to learn underlying hidden structures. We will discuss four overlapping but distinct major categories under ML: dimensionality reduction, unsupervised methods, supervised methods, and deep learning (DL). We review the preferred use cases of these algorithms in analyses of cancer epigenomics data with the hope to provide an overview of how ML approaches can be used to explore fundamental questions on the roles of epigenome in cancer biology and medicine.
Collapse
Affiliation(s)
- Emre Arslan
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Jonathan Schulz
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Kunal Rai
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America.
| |
Collapse
|
7
|
Klein S, Duda DG. Machine Learning for Future Subtyping of the Tumor Microenvironment of Gastro-Esophageal Adenocarcinomas. Cancers (Basel) 2021; 13:4919. [PMID: 34638408 PMCID: PMC8507866 DOI: 10.3390/cancers13194919] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 09/27/2021] [Accepted: 09/28/2021] [Indexed: 12/11/2022] Open
Abstract
Tumor progression involves an intricate interplay between malignant cells and their surrounding tumor microenvironment (TME) at specific sites. The TME is dynamic and is composed of stromal, parenchymal, and immune cells, which mediate cancer progression and therapy resistance. Evidence from preclinical and clinical studies revealed that TME targeting and reprogramming can be a promising approach to achieve anti-tumor effects in several cancers, including in GEA. Thus, it is of great interest to use modern technology to understand the relevant components of programming the TME. Here, we discuss the approach of machine learning, which recently gained increasing interest recently because of its ability to measure tumor parameters at the cellular level, reveal global features of relevance, and generate prognostic models. In this review, we discuss the relevant stromal composition of the TME in GEAs and discuss how they could be integrated. We also review the current progress in the application of machine learning in different medical disciplines that are relevant for the management and study of GEA.
Collapse
Affiliation(s)
- Sebastian Klein
- Gerhard-Domagk-Institute for Pathology, University Hospital Münster, 48149 Münster, Germany
- Institute for Pathology, Faculty of Medicine, University Hospital Cologne, University of Cologne, 50931 Cologne, Germany
| | - Dan G. Duda
- Edwin L. Steele Laboratories for Tumor Biology, Department of Radiation Oncology, Massachusetts General Hospital, Harvard Medical School, Boston, MA 02478, USA
| |
Collapse
|
8
|
Lee CS, Brandt JD, Lee AY. Big Data and Artificial Intelligence in Ophthalmology: Where Are We Now? OPHTHALMOLOGY SCIENCE 2021; 1:100036. [PMID: 36249294 PMCID: PMC9560652 DOI: 10.1016/j.xops.2021.100036] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Affiliation(s)
- Cecilia S. Lee
- Correspondence: Cecilia S. Lee, MD, MS, University of Washington, Box 359607, 325 Ninth Avenue, Seattle, WA 98104.
| | | | | |
Collapse
|