1
|
Zhang H, Lin C, Chen Y, Shen X, Wang R, Chen Y, Lyu J. Enhancing Molecular Network-Based Cancer Driver Gene Prediction Using Machine Learning Approaches: Current Challenges and Opportunities. J Cell Mol Med 2025; 29:e70351. [PMID: 39804102 PMCID: PMC11726689 DOI: 10.1111/jcmm.70351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 12/24/2024] [Accepted: 01/02/2025] [Indexed: 01/16/2025] Open
Abstract
Cancer is a complex disease driven by mutations in the genes that play critical roles in cellular processes. The identification of cancer driver genes is crucial for understanding tumorigenesis, developing targeted therapies and identifying rational drug targets. Experimental identification and validation of cancer driver genes are time-consuming and costly. Studies have demonstrated that interactions among genes are associated with similar phenotypes. Therefore, identifying cancer driver genes using molecular network-based approaches is necessary. Molecular network-based random walk-based approaches, which integrate mutation data with protein-protein interaction networks, have been widely employed in predicting cancer driver genes and demonstrated robust predictive potential. However, recent advancements in deep learning, particularly graph-based models, have provided novel opportunities for enhancing the prediction of cancer driver genes. This review aimed to comprehensively explore how machine learning methodologies, particularly network propagation, graph neural networks, autoencoders, graph embeddings, and attention mechanisms, improve the scalability and interpretability of molecular network-based cancer gene prediction.
Collapse
Affiliation(s)
- Hao Zhang
- Postgraduate Training Base Alliance of Wenzhou Medical UniversityWenzhouZhejiangChina
- Wenzhou Key Laboratory of Biophysics, Wenzhou InstituteUniversity of Chinese Academy of SciencesWenzhouZhejiangChina
| | - Chaohuan Lin
- Postgraduate Training Base Alliance of Wenzhou Medical UniversityWenzhouZhejiangChina
- Wenzhou Key Laboratory of Biophysics, Wenzhou InstituteUniversity of Chinese Academy of SciencesWenzhouZhejiangChina
| | - Ying'ao Chen
- Wenzhou Key Laboratory of Biophysics, Wenzhou InstituteUniversity of Chinese Academy of SciencesWenzhouZhejiangChina
| | | | - Ruizhe Wang
- Wenzhou Longwan High SchoolWenzhouZhejiangChina
| | - Yiqi Chen
- Wenzhou Longwan High SchoolWenzhouZhejiangChina
| | - Jie Lyu
- Postgraduate Training Base Alliance of Wenzhou Medical UniversityWenzhouZhejiangChina
- Wenzhou Key Laboratory of Biophysics, Wenzhou InstituteUniversity of Chinese Academy of SciencesWenzhouZhejiangChina
| |
Collapse
|
2
|
Saarinen H, Goldsmith M, Wang RS, Loscalzo J, Maniscalco S. Disease gene prioritization with quantum walks. Bioinformatics 2024; 40:btae513. [PMID: 39171848 PMCID: PMC11361815 DOI: 10.1093/bioinformatics/btae513] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 06/23/2024] [Accepted: 08/16/2024] [Indexed: 08/23/2024] Open
Abstract
MOTIVATION Disease gene prioritization methods assign scores to genes or proteins according to their likely relevance for a given disease based on a provided set of seed genes. This scoring can be used to find new biologically relevant genes or proteins for many diseases. Although methods based on classical random walks have proven to yield competitive results, quantum walk methods have not been explored to this end. RESULTS We propose a new algorithm for disease gene prioritization based on continuous-time quantum walks using the adjacency matrix of a protein-protein interaction (PPI) network. We demonstrate the success of our proposed quantum walk method by comparing it to several well-known gene prioritization methods on three disease sets, across seven different PPI networks. In order to compare these methods, we use cross-validation and examine the mean reciprocal ranks of recall and average precision values. We further validate our method by performing an enrichment analysis of the predicted genes for coronary artery disease. AVAILABILITY AND IMPLEMENTATION The data and code for the methods can be accessed at https://github.com/markgolds/qdgp.
Collapse
Affiliation(s)
- Harto Saarinen
- Algorithmiq Ltd, FI-00160 Helsinki, Finland
- Department of Mathematics and Statistics, Complex Systems Research Group, University of Turku, FI-20014, Turku, Finland
| | - Mark Goldsmith
- Algorithmiq Ltd, FI-00160 Helsinki, Finland
- Department of Mathematics and Statistics, Complex Systems Research Group, University of Turku, FI-20014, Turku, Finland
| | - Rui-Sheng Wang
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA 02115, United States
| | - Joseph Loscalzo
- Department of Medicine, Brigham and Women’s Hospital, Boston, MA 02115, United States
| | | |
Collapse
|
3
|
Ratajczak F, Joblin M, Hildebrandt M, Ringsquandl M, Falter-Braun P, Heinig M. Speos: an ensemble graph representation learning framework to predict core gene candidates for complex diseases. Nat Commun 2023; 14:7206. [PMID: 37938585 PMCID: PMC10632370 DOI: 10.1038/s41467-023-42975-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 10/27/2023] [Indexed: 11/09/2023] Open
Abstract
Understanding phenotype-to-genotype relationships is a grand challenge of 21st century biology with translational implications. The recently proposed "omnigenic" model postulates that effects of genetic variation on traits are mediated by core-genes and -proteins whose activities mechanistically influence the phenotype, whereas peripheral genes encode a regulatory network that indirectly affects phenotypes via core gene products. Here, we develop a positive-unlabeled graph representation-learning ensemble-approach based on a nested cross-validation to predict core-like genes for diverse diseases using Mendelian disorder genes for training. Employing mouse knockout phenotypes for external validations, we demonstrate that core-like genes display several key properties of core genes: Mouse knockouts of genes corresponding to our most confident predictions give rise to relevant mouse phenotypes at rates on par with the Mendelian disorder genes, and all candidates exhibit core gene properties like transcriptional deregulation in disease and loss-of-function intolerance. Moreover, as predicted for core genes, our candidates are enriched for drug targets and druggable proteins. In contrast to Mendelian disorder genes the new core-like genes are enriched for druggable yet untargeted gene products, which are therefore attractive targets for drug development. Interpretation of the underlying deep learning model suggests plausible explanations for our core gene predictions in form of molecular mechanisms and physical interactions. Our results demonstrate the potential of graph representation learning for the interpretation of biological complexity and pave the way for studying core gene properties and future drug development.
Collapse
Affiliation(s)
- Florin Ratajczak
- Institute of Network Biology (INET), Molecular Targets and Therapeutics Center (MTTC), Helmholtz Munich, Neuherberg, Germany
| | | | | | | | - Pascal Falter-Braun
- Institute of Network Biology (INET), Molecular Targets and Therapeutics Center (MTTC), Helmholtz Munich, Neuherberg, Germany.
- Microbe-Host Interactions, Faculty of Biology, Ludwig-Maximilians-Universität München, Planegg-Martinsried, Germany.
| | - Matthias Heinig
- Institute of Computational Biology (ICB), Helmholtz Munich, Neuherberg, Germany.
- Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- German Centre for Cardiovascular Research (DZHK), Munich Heart Association, Partner Site Munich, Berlin, Germany.
| |
Collapse
|
4
|
Hoang VT, Jeon HJ, You ES, Yoon Y, Jung S, Lee OJ. Graph Representation Learning and Its Applications: A Survey. SENSORS (BASEL, SWITZERLAND) 2023; 23:4168. [PMID: 37112507 PMCID: PMC10144941 DOI: 10.3390/s23084168] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 04/16/2023] [Accepted: 04/17/2023] [Indexed: 06/19/2023]
Abstract
Graphs are data structures that effectively represent relational data in the real world. Graph representation learning is a significant task since it could facilitate various downstream tasks, such as node classification, link prediction, etc. Graph representation learning aims to map graph entities to low-dimensional vectors while preserving graph structure and entity relationships. Over the decades, many models have been proposed for graph representation learning. This paper aims to show a comprehensive picture of graph representation learning models, including traditional and state-of-the-art models on various graphs in different geometric spaces. First, we begin with five types of graph embedding models: graph kernels, matrix factorization models, shallow models, deep-learning models, and non-Euclidean models. In addition, we also discuss graph transformer models and Gaussian embedding models. Second, we present practical applications of graph embedding models, from constructing graphs for specific domains to applying models to solve tasks. Finally, we discuss challenges for existing models and future research directions in detail. As a result, this paper provides a structured overview of the diversity of graph embedding models.
Collapse
Affiliation(s)
- Van Thuy Hoang
- Department of Artificial Intelligence, The Catholic University of Korea, 43, Jibong-ro, Bucheon-si 14662, Gyeonggi-do, Republic of Korea; (V.T.H.); (E.-S.Y.)
| | - Hyeon-Ju Jeon
- Data Assimilation Group, Korea Institute of Atmospheric Prediction Systems (KIAPS), 35, Boramae-ro 5-gil, Dongjak-gu, Seoul 07071, Republic of Korea;
| | - Eun-Soon You
- Department of Artificial Intelligence, The Catholic University of Korea, 43, Jibong-ro, Bucheon-si 14662, Gyeonggi-do, Republic of Korea; (V.T.H.); (E.-S.Y.)
| | - Yoewon Yoon
- Department of Social Welfare, Dongguk University, 30, Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea;
| | - Sungyeop Jung
- Semiconductor Devices and Circuits Laboratory, Advanced Institute of Convergence Technology (AICT), Seoul National University, 145, Gwanggyo-ro, Yeongtong-gu, Suwon-si 16229, Gyeonggi-do, Republic of Korea;
| | - O-Joun Lee
- Department of Artificial Intelligence, The Catholic University of Korea, 43, Jibong-ro, Bucheon-si 14662, Gyeonggi-do, Republic of Korea; (V.T.H.); (E.-S.Y.)
| |
Collapse
|
5
|
Wang L, Pan Z, Liu W, Wang J, Ji L, Shi D. A dual-attention based coupling network for diabetes classification with heterogeneous data. J Biomed Inform 2023; 139:104300. [PMID: 36736446 DOI: 10.1016/j.jbi.2023.104300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 12/02/2022] [Accepted: 01/26/2023] [Indexed: 02/05/2023]
Abstract
Diabetes Mellitus (DM) is a group of metabolic disorders characterized by hyperglycaemia in the absence of treatment. Classification of DM is essential as it corresponds to the respective diagnosis and treatment. In this paper, we propose a new coupling network with hierarchical dual-attention that utilizes heterogeneous data, including Flash Glucose Monitoring (FGM) data and biomarkers in electronic medical records. The long short-term memory-based FGM sub-network extracts the time-dependent features of dynamic FGM sequences, while the biomarkers sub-network learns the features of static biomarkers. The convolutional block attention module (CBAM) for dispersing the feature weights of the spatial and channel dimensions is implemented into the FGM sub-network to endure the variability of FGM and allows us to extract high-level discriminative features more accurately. To better adjust the importance weights of the characteristics of the two sub-networks, self-attention is introduced to integrate the characteristics of heterogeneous data. Based on the dataset provided by Peking University People's Hospital, the proposed method is evaluated through factorial experiments of multi-source heterogeneous data, ablation studies of various attention strategies, time consumption evaluation and quantitative evaluation. The benchmark tests reveal the proposed network achieves a type 1 and 2 diabetes classification accuracy of 95.835% and the comprehensive performance metrics, including Matthews correlation coefficient, F1-score and G-mean, are 91.333%, 94.939% and 94.937% respectively. In the factorial experiments, the proposed method reaches the maximum area under the receiver operating characteristic curve of 0.9428, which indicates the effectiveness of the coupling between the nominated sub-networks. The coupling network with a dual-attention strategy performs better than the one without or only with a single-attention strategy in the ablation study as well. In addition, the model is also tested on another data set, and the accuracy of the test reaches 94.286%, reflecting that the model is robust when it is transferred to untrained diabetes data. The experimental results show that the proposed method is feasible in the classification of diabetes types. The code is available at https://github.com/bitDalei/Diabetes-Classification-with-Heterogeneous-Data.
Collapse
Affiliation(s)
- Lei Wang
- Institute of Engineering Medicine, Beijing Institute of Technology, Beijing, China
| | - Zhenglin Pan
- Department of Endocrinology and Metabolism, Peking University People's Hospital, Beijing, China
| | - Wei Liu
- Department of Endocrinology and Metabolism, Peking University People's Hospital, Beijing, China.
| | - Junzheng Wang
- MIIT Key Laboratory of Servo Motion Systems Drive and Control, School of Automation, Beijing Institute of Technology, Beijing, China
| | - Linong Ji
- Department of Endocrinology and Metabolism, Peking University People's Hospital, Beijing, China
| | - Dawei Shi
- Institute of Engineering Medicine, Beijing Institute of Technology, Beijing, China; MIIT Key Laboratory of Servo Motion Systems Drive and Control, School of Automation, Beijing Institute of Technology, Beijing, China.
| |
Collapse
|
6
|
Manconi A, Gnocchi M, Milanesi L, Marullo O, Armano G. Framing Apache Spark in life sciences. Heliyon 2023; 9:e13368. [PMID: 36852030 PMCID: PMC9958288 DOI: 10.1016/j.heliyon.2023.e13368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 01/19/2023] [Accepted: 01/29/2023] [Indexed: 02/11/2023] Open
Abstract
Advances in high-throughput and digital technologies have required the adoption of big data for handling complex tasks in life sciences. However, the drift to big data led researchers to face technical and infrastructural challenges for storing, sharing, and analysing them. In fact, this kind of tasks requires distributed computing systems and algorithms able to ensure efficient processing. Cutting edge distributed programming frameworks allow to implement flexible algorithms able to adapt the computation to the data over on-premise HPC clusters or cloud architectures. In this context, Apache Spark is a very powerful HPC engine for large-scale data processing on clusters. Also thanks to specialised libraries for working with structured and relational data, it allows to support machine learning, graph-based computation, and stream processing. This review article is aimed at helping life sciences researchers to ascertain the features of Apache Spark and to assess whether it can be successfully used in their research activities.
Collapse
Affiliation(s)
- Andrea Manconi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Matteo Gnocchi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Luciano Milanesi
- Institute of Biomedical Technologies - National Research Council of Italy, Segrate (Mi), Italy
| | - Osvaldo Marullo
- Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy
| | - Giuliano Armano
- Department of Mathematics and Computer science - University of Cagliari, Cagliari, Italy
| |
Collapse
|
7
|
Wang H, Wang X, Liu W, Xie X, Peng S. deepDGA: Biomedical Heterogeneous Network-based Deep Learning Framework for Disease-Gene Association Predictions. 2022 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM) 2022:601-606. [DOI: 10.1109/bibm55620.2022.9995651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
Affiliation(s)
- Hong Wang
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| | - Xiaoqi Wang
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| | - Wenjuan Liu
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| | - Xiaolan Xie
- Guilin University of Technology,College of Information Science and Engineering,Guilin,China
| | - Shaoliang Peng
- Hunan University,College of Computer Science and Electronic Engineering,Changsha,China
| |
Collapse
|
8
|
Predicting Parkinson disease related genes based on PyFeat and gradient boosted decision tree. Sci Rep 2022; 12:10004. [PMID: 35705654 PMCID: PMC9200794 DOI: 10.1038/s41598-022-14127-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2022] [Accepted: 06/01/2022] [Indexed: 11/10/2022] Open
Abstract
Identifying genes related to Parkinson’s disease (PD) is an active research topic in biomedical analysis, which plays a critical role in diagnosis and treatment. Recently, many studies have proposed different techniques for predicting disease-related genes. However, a few of these techniques are designed or developed for PD gene prediction. Most of these PD techniques are developed to identify only protein genes and discard long noncoding (lncRNA) genes, which play an essential role in biological processes and the transformation and development of diseases. This paper proposes a novel prediction system to identify protein and lncRNA genes related to PD that can aid in an early diagnosis. First, we preprocessed the genes into DNA FASTA sequences from the University of California Santa Cruz (UCSC) genome browser and removed the redundancies. Second, we extracted some significant features of DNA FASTA sequences using the PyFeat method with the AdaBoost as feature selection. These selected features achieved promising results compared with extracted features from some state-of-the-art feature extraction techniques. Finally, the features were fed to the gradient-boosted decision tree (GBDT) to diagnose different tested cases. Seven performance metrics were used to evaluate the performance of the proposed system. The proposed system achieved an average accuracy of 78.6%, the area under the curve equals 84.5%, the area under precision-recall (AUPR) equals 85.3%, F1-score equals 78.3%, Matthews correlation coefficient (MCC) equals 0.575, sensitivity (SEN) equals 77.1%, and specificity (SPC) equals 80.2%. The experiments demonstrate promising results compared with other systems. The predicted top-rank protein and lncRNA genes are verified based on a literature review.
Collapse
|
9
|
Joodaki M, Dowlatshahi MB, Joodaki NZ. An ensemble feature selection algorithm based on PageRank centrality and fuzzy logic. Knowl Based Syst 2021. [DOI: 10.1016/j.knosys.2021.107538] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|