1
|
Rehman A, Mujahid M, Saba T, Jeon G. Optimised stacked machine learning algorithms for genomics and genetics disorder detection in the healthcare industry. Funct Integr Genomics 2024; 24:23. [PMID: 38305949 DOI: 10.1007/s10142-024-01289-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 12/22/2023] [Accepted: 01/02/2024] [Indexed: 02/03/2024]
Abstract
With recent advances in precision medicine and healthcare computing, there is an enormous demand for developing machine learning algorithms in genomics to enhance the rapid analysis of disease disorders. Technological advancement in genomics and imaging provides clinicians with enormous amounts of data, but prediction is still mostly subjective, resulting in problematic medical treatment. Machine learning is being employed in several domains of the healthcare sector, encompassing clinical research, early disease identification, and medicinal innovation with a historical perspective. The main objective of this study is to detect patients who, based on several medical standards, are more susceptible to having a genetic disorder. A genetic disease prediction algorithm was employed, leveraging the patient's health history to evaluate the probability of diagnosing a genetic disorder. We developed a computationally efficient machine learning approach to predict the overall lifespan of patients with a genomics disorder and to classify and predict patients with a genetic disease. The SVM, RF, and ETC are stacked using two-layer meta-estimators to develop the proposed model. The first layer comprises all the baseline models employed to predict the outcomes based on the dataset. The second layer comprises a component known as a meta-classifier. Results from the experiment indicate that the model achieved an accuracy of 90.45% and a recall score of 90.19%. The area under the curve (AUC) for mitochondrial diseases is 98.1%; for multifactorial diseases, it is 97.5%; and for single-gene inheritance, it is 98.8%. The proposed approach presents a novel method for predicting patient prognosis in a manner that is unbiased, accurate, and comprehensive. The proposed approach outperforms human professionals using the current clinical standard for genetic disease classification in terms of identification accuracy. The implementation of stacked will significantly improve the field of biomedical research by improving the anticipation of genetic diseases.
Collapse
Affiliation(s)
- Amjad Rehman
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
| | - Muhammad Mujahid
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
| | - Tanzila Saba
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia
| | - Gwanggil Jeon
- Artificial Intelligence & Data Analytics Lab, CCIS, Prince Sultan University, Riyadh, 11586, Saudi Arabia.
- Department of Embedded Systems Engineering, Incheon National University, Incheon, 610101, Korea.
| |
Collapse
|
2
|
Song E. Persistent homology analysis of type 2 diabetes genome-wide association studies in protein-protein interaction networks. Front Genet 2023; 14:1270185. [PMID: 37823029 PMCID: PMC10562725 DOI: 10.3389/fgene.2023.1270185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Accepted: 09/12/2023] [Indexed: 10/13/2023] Open
Abstract
Genome-wide association studies (GWAS) involving increasing sample sizes have identified hundreds of genetic variants associated with complex diseases, such as type 2 diabetes (T2D); however, it is unclear how GWAS hits form unique topological structures in protein-protein interaction (PPI) networks. Using persistent homology, this study explores the evolution and persistence of the topological features of T2D GWAS hits in the PPI network with increasing p-value thresholds. We define an n-dimensional persistent disease module as a higher-order generalization of the largest connected component (LCC). The 0-dimensional persistent T2D disease module is the LCC of the T2D GWAS hits, which is significantly detected in the PPI network (196 nodes and 235 edges, P< 0.05). In the 1-dimensional homology group analysis, all 18 1-dimensional holes (loops) of the T2D GWAS hits persist over all p-value thresholds. The 1-dimensional persistent T2D disease module comprising these 18 persistent 1-dimensional holes is significantly larger than that expected by chance (59 nodes and 83 edges, P< 0.001), indicating a significant topological structure in the PPI network. Our computational topology framework potentially possesses broad applicability to other complex phenotypes in identifying topological features that play an important role in disease pathobiology.
Collapse
Affiliation(s)
- Euijun Song
- Yonsei University College of Medicine, Seoul, Republic of Korea
- Present: Independent Researcher, Gyeonggi, Republic of Korea
| |
Collapse
|
3
|
Ghasemi M, Rahgozar M, Kavousi K. Complex Disease Genes Identification Using a Heterogeneous Network Embedding Approach. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:875-882. [PMID: 35594221 DOI: 10.1109/tcbb.2022.3175598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Finding the causal relation between a gene and a disease using experimental approaches is a time-consuming and expensive task. However, computational approaches are cost-efficient methods for identifying candidate genes. This article proposes a new heterogeneous biological network embedding approach, named NetEM, to identify disease-associated genes. To evaluate NetEM, we examine six complex diseases, including peroxisomal disorders, sarcoma, grave's disease, lysosomal storage diseases, blood coagulation disorders, and cardiomyopathy hypertrophic. Our experiments indicate that NetEM outperforms three well-known state-of-the-art algorithms: Cardigan, DIAMOnD and GeneWanderer, in identifying disease genes. We examine TCGA data of Invasive Lobular Breast Cancer and CPTAC data of human glioblastoma as other case studies to evaluate NetEM using real data. This evaluation also indicates the validity of the method. The source codes of NetEM and data are available in the supplementary of this article.
Collapse
|
4
|
Yang K, Yang Y, Fan S, Xia J, Zheng Q, Dong X, Liu J, Liu Q, Lei L, Zhang Y, Li B, Gao Z, Zhang R, Liu B, Wang Z, Zhou X. DRONet: effectiveness-driven drug repositioning framework using network embedding and ranking learning. Brief Bioinform 2023; 24:6958501. [PMID: 36562715 DOI: 10.1093/bib/bbac518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Revised: 10/11/2022] [Accepted: 10/31/2022] [Indexed: 12/24/2022] Open
Abstract
As one of the most vital methods in drug development, drug repositioning emphasizes further analysis and research of approved drugs based on the existing large amount of clinical and experimental data to identify new indications of drugs. However, the existing drug repositioning methods didn't achieve enough prediction performance, and these methods do not consider the effectiveness information of drugs, which make it difficult to obtain reliable and valuable results. In this study, we proposed a drug repositioning framework termed DRONet, which make full use of effectiveness comparative relationships (ECR) among drugs as prior information by combining network embedding and ranking learning. We utilized network embedding methods to learn the deep features of drugs from a heterogeneous drug-disease network, and constructed a high-quality drug-indication data set including effectiveness-based drug contrast relationships. The embedding features and ECR of drugs are combined effectively through a designed ranking learning model to prioritize candidate drugs. Comprehensive experiments show that DRONet has higher prediction accuracy (improving 87.4% on Hit@1 and 37.9% on mean reciprocal rank) than state of the art. The case analysis also demonstrates high reliability of predicted results, which has potential to guide clinical drug development.
Collapse
Affiliation(s)
- Kuo Yang
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, China
| | | | - Shuyue Fan
- Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, China
| | - Jianan Xia
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, China
| | - Qiguang Zheng
- Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, China
| | - Xin Dong
- Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, China
| | - Jun Liu
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, China
| | - Qiong Liu
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, China
| | - Lei Lei
- Institute of Information on Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, China
| | - Yingying Zhang
- Dongzhimen Hospital, Beijing University of Chinese Medicine, China
| | - Bing Li
- Institute of Chinese Materia Medica, China Academy of Chinese Medical Sciences, China
| | - Zhuye Gao
- Xiyuan Hospital, China Academy of Chinese Medical Sciences, National Clinical Research Center for Chinese Medicine Cardiology, China
| | - Runshun Zhang
- Guanganmen Hospital, China Academy of Chinese Medical Sciences, China
| | - Baoyan Liu
- Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, China
| | - Zhong Wang
- Institute of Basic Research in Clinical Medicine, China Academy of Chinese Medical Sciences, China
| | - Xuezhong Zhou
- Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, China
| |
Collapse
|
5
|
Knowledge-Based Recurrent Neural Network for TCM Cerebral Palsy Diagnosis. EVIDENCE-BASED COMPLEMENTARY AND ALTERNATIVE MEDICINE 2022; 2022:7708376. [PMID: 36276852 PMCID: PMC9581687 DOI: 10.1155/2022/7708376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Accepted: 08/30/2022] [Indexed: 11/07/2022]
Abstract
Cerebral palsy is one of the most prevalent neurological disorders and the most frequent cause of disability. Identifying the syndrome by patients' symptoms is the key to traditional Chinese medicine (TCM) cerebral palsy treatment. Artificial intelligence (AI) is advancing quickly in several sectors, including TCM. AI will considerably enhance the dependability and precision of diagnoses, expanding effective treatment methods' usage. Thus, for cerebral palsy, it is necessary to build a decision-making model to aid in the syndrome diagnosis process. While the recurrent neural network (RNN) model has the potential to capture the correlation between symptoms and syndromes from electronic medical records (EMRs), it lacks TCM knowledge. To make the model benefit from both TCM knowledge and EMRs, unlike the ordinary training routine, we begin by constructing a knowledge-based RNN (KBRNN) based on the cerebral palsy knowledge graph for domain knowledge. More specifically, we design an evolution algorithm for extracting knowledge in the cerebral palsy knowledge graph. Then, we embed the knowledge into tensors and inject them into the RNN. In addition, the KBRNN can benefit from the labeled EMRs. We use EMRs to fine-tune the KBRNN, which improves prediction accuracy. Our study shows that knowledge injection can effectively improve the model effect. The KBRNN can achieve 79.31% diagnostic accuracy with only knowledge injection. Moreover, the KBRNN can be further trained by the EMRs. The results show that the accuracy of fully trained KBRNN is 83.12%.
Collapse
|
6
|
Yang M, Huang ZA, Gu W, Han K, Pan W, Yang X, Zhu Z. Prediction of biomarker-disease associations based on graph attention network and text representation. Brief Bioinform 2022; 23:6651308. [PMID: 35901464 DOI: 10.1093/bib/bbac298] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Revised: 06/28/2022] [Accepted: 06/30/2022] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION The associations between biomarkers and human diseases play a key role in understanding complex pathology and developing targeted therapies. Wet lab experiments for biomarker discovery are costly, laborious and time-consuming. Computational prediction methods can be used to greatly expedite the identification of candidate biomarkers. RESULTS Here, we present a novel computational model named GTGenie for predicting the biomarker-disease associations based on graph and text features. In GTGenie, a graph attention network is utilized to characterize diverse similarities of biomarkers and diseases from heterogeneous information resources. Meanwhile, a pretrained BERT-based model is applied to learn the text-based representation of biomarker-disease relation from biomedical literature. The captured graph and text features are then integrated in a bimodal fusion network to model the hybrid entity representation. Finally, inductive matrix completion is adopted to infer the missing entries for reconstructing relation matrix, with which the unknown biomarker-disease associations are predicted. Experimental results on HMDD, HMDAD and LncRNADisease data sets showed that GTGenie can obtain competitive prediction performance with other state-of-the-art methods. AVAILABILITY The source code of GTGenie and the test data are available at: https://github.com/Wolverinerine/GTGenie.
Collapse
Affiliation(s)
- Minghao Yang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518000, China
| | - Zhi-An Huang
- Center for Computer Science and Information Technology, City University of Hong Kong Dongguan Research Institute, Dongguan, China
| | - Wenhao Gu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518000, China.,GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Kun Han
- GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Wenying Pan
- GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Xiao Yang
- GeneGenieDx Corp, 160 E Tasman Dr, San Jose, CA 95134
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518000, China
| |
Collapse
|
7
|
Perry Fordson H, Xing X, Guo K, Xu X. Emotion Recognition With Knowledge Graph Based on Electrodermal Activity. Front Neurosci 2022; 16:911767. [PMID: 35757534 PMCID: PMC9220300 DOI: 10.3389/fnins.2022.911767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 04/25/2022] [Indexed: 11/13/2022] Open
Abstract
Electrodermal activity (EDA) sensor is emerging non-invasive equipment in affect detection research, which is used to measure electrical activities of the skin. Knowledge graphs are an effective way to learn representation from data. However, few studies analyzed the effect of knowledge-related graph features with physiological signals when subjects are in non-similar mental states. In this paper, we propose a model using deep learning techniques to classify the emotional responses of individuals acquired from physiological datasets. We aim to improve the execution of emotion recognition based on EDA signals. The proposed framework is based on observed gender and age information as embedding feature vectors. We also extract time and frequency EDA features in line with cognitive studies. We then introduce a sophisticated weighted feature fusion method that combines knowledge embedding feature vectors and statistical feature (SF) vectors for emotional state classification. We finally utilize deep neural networks to optimize our approach. Results obtained indicated that the correct combination of Gender-Age Relation Graph (GARG) and SF vectors improve the performance of the valence-arousal emotion recognition system by 4 and 5% on PAFEW and 3 and 2% on DEAP datasets.
Collapse
Affiliation(s)
- Hayford Perry Fordson
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
| | - Xiaofen Xing
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
| | - Kailing Guo
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
| | - Xiangmin Xu
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China.,School of Future Technology and the School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
| |
Collapse
|
8
|
Hou S, Zhang P, Yang K, Wang L, Ma C, Li Y, Li S. Decoding multilevel relationships with the human tissue-cell-molecule network. Brief Bioinform 2022; 23:6585388. [PMID: 35551347 DOI: 10.1093/bib/bbac170] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Revised: 04/09/2022] [Accepted: 04/16/2022] [Indexed: 02/01/2023] Open
Abstract
Understanding the biological functions of molecules in specific human tissues or cell types is crucial for gaining insights into human physiology and disease. To address this issue, it is essential to systematically uncover associations among multilevel elements consisting of disease phenotypes, tissues, cell types and molecules, which could pose a challenge because of their heterogeneity and incompleteness. To address this challenge, we describe a new methodological framework, called Graph Local InfoMax (GLIM), based on a human multilevel network (HMLN) that we established by introducing multiple tissues and cell types on top of molecular networks. GLIM can systematically mine the potential relationships between multilevel elements by embedding the features of the HMLN through contrastive learning. Our simulation results demonstrated that GLIM consistently outperforms other state-of-the-art algorithms in disease gene prediction. Moreover, GLIM was also successfully used to infer cell markers and rewire intercellular and molecular interactions in the context of specific tissues or diseases. As a typical case, the tissue-cell-molecule network underlying gastritis and gastric cancer was first uncovered by GLIM, providing systematic insights into the mechanism underlying the occurrence and development of gastric cancer. Overall, our constructed methodological framework has the potential to systematically uncover complex disease mechanisms and mine high-quality relationships among phenotypical, tissue, cellular and molecular elements.
Collapse
Affiliation(s)
- Siyu Hou
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Peng Zhang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Kuo Yang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, 100084 Beijing, China.,School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
| | - Lan Wang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Changzheng Ma
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Yanda Li
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, 100084 Beijing, China
| | - Shao Li
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, 100084 Beijing, China
| |
Collapse
|
9
|
Zhou K, Zhang S, Wang Y, Cohen KB, Kim JD, Luo Q, Yao X, Zhou X, Xia J. High-quality gene/disease embedding in a multi-relational heterogeneous graph after a joint matrix/tensor decomposition. J Biomed Inform 2022; 126:103973. [PMID: 34995810 DOI: 10.1016/j.jbi.2021.103973] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 10/20/2021] [Accepted: 12/04/2021] [Indexed: 10/19/2022]
Abstract
MOTIVATION Node embedding of biological entity network has been widely investigated for the downstream application scenarios. To embed full semantics of gene and disease, a multi-relational heterogeneous graph is considered in a scenario where uni-relation between gene/disease and other heterogeneous entities are abundant while multi-relation between gene and disease is relatively sparse. After introducing this novel graph format, it is illuminative to design a specific data integration algorithm to fully capture the graph information and bring embeddings with high quality. RESULTS First, a typical multi-relational triple dataset was introduced, which carried significant association between gene and disease. Second, we curated all human genes and diseases in seven mainstream datasets and constructed a large-scale gene-disease network, which compromising 163,024 nodes and 25,265,607 edges, and relates to 27,165 genes, 2,665 diseases, 15,067 chemicals, 108,023 mutations, 2,363 pathways, and 7.732 phenotypes. Third, we proposed a Joint Decomposition of Heterogeneous Matrix and Tensor (JDHMT) model, which integrated all heterogeneous data resources and obtained embedding for each gene or disease. Forth, a visualized intrinsic evaluation was performed, which investigated the embeddings in terms of interpretable data clustering. Furthermore, an extrinsic evaluation was performed in the form of linking prediction. Both intrinsic and extrinsic evaluation results showed that JDHMT model outperformed other eleven state-of-the-art (SOTA) methods which are under relation-learning, proximity-preserving or message-passing paradigms. Finally, the constructed gene-disease network, embedding results and codes were made available. DATA AND CODES AVAILABILITY The constructed massive gene-disease network is available at: https://hzaubionlp.com/heterogeneous-biological-network/. The codes are available at: https://github.com/bionlp-hzau/JDHMT.
Collapse
Affiliation(s)
- Kaiyin Zhou
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, Hubei, China
| | - Sheng Zhang
- College of Science, Huazhong Agricultural University, Wuhan 430070, Hubei, China
| | - Yuxing Wang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, Hubei, China
| | - Kevin Bretonnel Cohen
- School of Medicine, University of Colorado Denver, Anschutz Medical Campus, Aurora, CO, USA
| | - Jin-Dong Kim
- Database Center for Life Science (DBCLS), Research Organization of Information and Systems (ROIS), Kashiwa, Tokyo, Japan
| | - Qi Luo
- College of Science, Huazhong Agricultural University, Wuhan 430070, Hubei, China
| | - Xinzhi Yao
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, Hubei, China
| | - Xingyu Zhou
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, Hubei, China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, Hubei, China.
| |
Collapse
|
10
|
Yang K, Lu K, Wu Y, Yu J, Liu B, Zhao Y, Chen J, Zhou X. A network-based machine-learning framework to identify both functional modules and disease genes. Hum Genet 2021; 140:897-913. [PMID: 33409574 DOI: 10.1007/s00439-020-02253-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 12/22/2020] [Indexed: 01/20/2023]
Abstract
Disease gene identification is a critical step towards uncovering the molecular mechanisms of diseases and systematically investigating complex disease phenotypes. Despite considerable efforts to develop powerful computing methods, candidate gene identification remains a severe challenge owing to the connectivity of an incomplete interactome network, which hampers the discovery of true novel candidate genes. We developed a network-based machine-learning framework to identify both functional modules and disease candidate genes. In this framework, we designed a semi-supervised non-negative matrix factorization model to obtain the functional modules related to the diseases and genes. Of note, we proposed a disease gene-prioritizing method called MapGene that integrates the correlations from both functional modules and network closeness. Our framework identified a set of functional modules with highly functional homogeneity and close gene interactions. Experiments on a large-scale benchmark dataset showed that MapGene performs significantly better than the state-of-the-art algorithms. Further analysis demonstrates MapGene can effectively relieve the impact of the incompleteness of interactome networks and obtain highly reliable rankings of candidate genes. In addition, disease cases on Parkinson's disease and diabetes mellitus confirmed the generalization of MapGene for novel candidate gene identification. This work proposed, for the first time, an integrated computing framework to predict both functional modules and disease candidate genes. The methodology and results support that our framework has the potential to help discover underlying functional modules and reliable candidate genes in human disease.
Collapse
Affiliation(s)
- Kuo Yang
- School of Computer and Information Technology, Institute of Medical Intelligence, Beijing Jiaotong University, Beijing, 100044, China.,Institute for TCM-X, MOE Key Laboratory of Bioinformatics / Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing, 10084, China
| | - Kezhi Lu
- School of Computer and Information Technology, Institute of Medical Intelligence, Beijing Jiaotong University, Beijing, 100044, China.,imec-DistriNet, KU Leuven, Leuven, 3001, Belgium
| | - Yang Wu
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Jian Yu
- Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
| | - Baoyan Liu
- Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Yi Zhao
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Jianxin Chen
- Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Xuezhong Zhou
- School of Computer and Information Technology, Institute of Medical Intelligence, Beijing Jiaotong University, Beijing, 100044, China. .,Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China.
| |
Collapse
|