1
|
Qiu X, Wang H, Tan X, Fang Z. G-K BertDTA: A graph representation learning and semantic embedding-based framework for drug-target affinity prediction. Comput Biol Med 2024; 173:108376. [PMID: 38552281 DOI: 10.1016/j.compbiomed.2024.108376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 03/21/2024] [Accepted: 03/24/2024] [Indexed: 04/17/2024]
Abstract
Developing new drugs is costly, time-consuming, and risky. Drug-target affinity (DTA), indicating the binding capability between drugs and target proteins, is a crucial indicator for drug development. Accurately predicting interaction strength between new drug-target pairs by analyzing previous experiments aids in screening potential drug molecules, repurposing them, and developing safe and effective medicines. Existing computational models for DTA prediction rely on strings or single-graph neural networks, lacking consideration of protein structure and molecular semantic information, leading to limited accuracy. Our experiments demonstrate that string-based methods may overlook protein conformations, causing a high root mean square error (RMSE) of 3.584 in affinity due to a lack of spatial context. Single graph networks also underperform on topology features, with a 6% lower confidence interval (CI) for activity classification. Absent semantic information also limits generalization across diverse compounds, resulting in 18% increment in RMSE and 5% in misclassifications within quantifications study, restricting potential drug discovery. To address these limitations, we propose G-K BertDTA, a novel framework for accurate DTA prediction incorporating protein features, molecular semantic features, and molecular structural information. In this proposed model, we represent drugs as graphs, with a GIN employed to learn the molecular topological information. For the extraction of protein structural features, we utilize a DenseNet architecture. A knowledge-based BERT semantic model is incorporated to obtain rich pre-trained semantic embeddings, thereby enhancing the feature information. We extensively evaluated our proposed approach on the publicly available benchmark datasets (i.e., KIBA and Davis), and experimental results demonstrate the promising performance of our method, which consistently outperforms previous state-of-the-art approaches. Code is available at https://github.com/AmbitYuki/G-K-BertDTA.
Collapse
Affiliation(s)
- Xihe Qiu
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Haoyu Wang
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Xiaoyu Tan
- INF Technology (Shanghai) Co., Ltd., Shanghai, China
| | - Zhijun Fang
- School of Computer Science and Technology, Donghua University, Shanghai, China.
| |
Collapse
|
2
|
Štancl P, Karlić R. Machine learning for pan-cancer classification based on RNA sequencing data. Front Mol Biosci 2023; 10:1285795. [PMID: 38028533 PMCID: PMC10667476 DOI: 10.3389/fmolb.2023.1285795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2023] [Accepted: 10/30/2023] [Indexed: 12/01/2023] Open
Abstract
Despite recent improvements in cancer diagnostics, 2%-5% of all malignancies are still cancers of unknown primary (CUP), for which the tissue-of-origin (TOO) cannot be determined at the time of presentation. Since the primary site of cancer leads to the choice of optimal treatment, CUP patients pose a significant clinical challenge with limited treatment options. Data produced by large-scale cancer genomics initiatives, which aim to determine the genomic, epigenomic, and transcriptomic characteristics of a large number of individual patients of multiple cancer types, have led to the introduction of various methods that use machine learning to predict the TOO of cancer patients. In this review, we assess the reproducibility, interpretability, and robustness of results obtained by 20 recent studies that utilize different machine learning methods for TOO prediction based on RNA sequencing data, including their reported performance on independent data sets and identification of important features. Our review investigates the strengths and weaknesses of different methods, checks the correspondence of their results, and identifies potential issues with datasets used for model training and testing, assessing their potential usefulness in a clinical setting and suggesting future improvements.
Collapse
Affiliation(s)
| | - Rosa Karlić
- Bioinformatics Group, Division of Molecular Biology, Department of Biology, Faculty of Science, University of Zagreb, Zagreb, Croatia
| |
Collapse
|
3
|
Ashraf MT, Hamid I, Nawaz Q, Ali H. Hybrid Approach using Extreme Gradient Boosting (XGBoost) and Evolutionary Algorithm for Cancer Classification. 2023 INTERNATIONAL MULTI-DISCIPLINARY CONFERENCE IN EMERGING RESEARCH TRENDS (IMCERT) 2023. [DOI: 10.1109/imcert57083.2023.10075236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Affiliation(s)
| | - Isma Hamid
- National Textie University,Department of Computer Science,Faisalabad,Pakistan
| | - Qamar Nawaz
- University of Agriculture,Department of Computer Science,Faisalabad,Pakistan
| | - Hamid Ali
- National Textile University,Department of Computer Science,Faisalabad,Pakistan
| |
Collapse
|
4
|
Wang J, Chu H, Pan Y. Prediction of renal damage in children with IgA vasculitis based on machine learning. Medicine (Baltimore) 2022; 101:e31135. [PMID: 36281102 PMCID: PMC9592501 DOI: 10.1097/md.0000000000031135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
This article is objected to explore the value of machine learning algorithm in predicting the risk of renal damage in children with IgA vasculitis by constructing a predictive model and analyzing the related risk factors of IgA vasculitis Nephritis in children. Case data of 288 hospitalized children with IgA vasculitis from November 2018 to October 2021 were collected. The data included 42 indicators such as demographic characteristics, clinical symptoms and laboratory tests, etc. Univariate feature selection was used for feature extraction, and logistic regression, support vector machine (SVM), decision tree and random forest (RF) algorithms were used separately for classification prediction. Lastly, the performance of four algorithms is compared using accuracy rate, recall rate and AUC. The accuracy rate, recall rate and AUC of the established RF model were 0.83, 0.86 and 0.91 respectively, which were higher than 0.74, 0.80 and 0.89 of the logistic regression model; higher than 0.70, 0.80 and 0.89 of SVM model; higher than 0.74, 0.80 and 0.81 of the decision tree model. The top 10 important features provided by RF model are: Persistent purpura ≥4 weeks, Cr, Clinic time, ALB, WBC, TC, Relapse, TG, Recurrent purpura and EB-DNA. The model based on RF algorithm has better performance in the prediction of children with IgA vasculitis renal damage, indicated by better classification accuracy, better classification effect and better generalization performance.
Collapse
Affiliation(s)
- Jinjuan Wang
- Shandong University of Traditional Chinese Medicine, Shandong, PR China
| | - Huimin Chu
- Shandong University of Traditional Chinese Medicine, Shandong, PR China
| | - Yueli Pan
- Shandong University of Traditional Chinese Medicine, Shandong, PR China
- * Correspondence: Yueli Pan, Affiliated Hospital of Shandong University of Traditional Chinese Medicine, Jinan, Shandong 250014, PR China (e-mail: )
| |
Collapse
|
5
|
Analyzing RNA-Seq Gene Expression Data Using Deep Learning Approaches for Cancer Classification. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12041850] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Ribonucleic acid Sequencing (RNA-Seq) analysis is particularly useful for obtaining insights into differentially expressed genes. However, it is challenging because of its high-dimensional data. Such analysis is a tool with which to find underlying patterns in data, e.g., for cancer specific biomarkers. In the past, analyses were performed on RNA-Seq data pertaining to the same cancer class as positive and negative samples, i.e., without samples of other cancer types. To perform multiple cancer type classification and to find differentially expressed genes, data for multiple cancer types need to be analyzed. Several repositories offer RNA-Seq data for various cancer types. In this paper, data from the Mendeley data repository for five cancer types are analyzed. As a first step, RNA-Seq values are converted to 2D images using normalization and zero padding. In the next step, relevant features are extracted and selected using Deep Learning (DL). In the last phase, classification is performed, and eight DL algorithms are used. Results and discussion are based on four different splitting strategies and k-fold cross validation for each DL classifier. Furthermore, a comparative analysis is performed with state of the art techniques discussed in literature. The results demonstrated that classifiers performed best at 70–30 split, and that Convolutional Neural Network (CNN) achieved the best overall results. Hence, CNN is the best DL model for classification among the eight studied DL models, and is easy to implement and simple to understand.
Collapse
|
6
|
Xie G, Li J, Gu G, Sun Y, Lin Z, Zhu Y, Wang W. BGMSDDA: a bipartite graph diffusion algorithm with multiple similarity integration for drug-disease association prediction. Mol Omics 2021; 17:997-1011. [PMID: 34610633 DOI: 10.1039/d1mo00237f] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Drug repositioning, a method that relies on the information from the original drug-disease association matrix, aims to identify new indications for existing drugs and is expected to greatly reduce the cost and time of drug development. However, most current drug repositioning methods make use of the original drug-disease association matrix directly without preconditioning. As relatively only a few associations between drugs and diseases have been determined from actual observations, the original drug-disease association matrix used in the prediction is sparse, which affects the performance of the prediction method. A method for mining similar features of drugs and diseases is still lacking. To solve these problems, we developed a bipartite graph diffusion algorithm with multiple similarity integration for drug-disease association prediction (BGMSDDA). First, the weight K nearest known neighbors (WKNKN) algorithm was used to reconstruct the drug-disease association matrix. Secondly, an effective method was designed to extract similar characteristics of drugs and diseases based on integrating linear neighborhood similarity and Gaussian kernel similarity. Finally, bipartite graph diffusion was used to infer undiscovered drug-disease associations. After carrying out 10-fold cross-validation experiments, BGMSDDA showed excellent performance on two datasets, specifically with AUC values of 0.939 (Fdataset) and 0.954 (Cdataset), and AUPR values of 0.466 (Fdataset) and 0.565 (Cdataset). Furthermore, to evaluate the accuracy of the results of BGMSDDA, we conducted case studies on three medically used drugs selected from Fdataset and Cdataset and validated the predictive associated diseases of each drug with some databases. Based on the results obtained, BGMSDDA was demonstrated to be useful for predicting drug-disease associations.
Collapse
Affiliation(s)
- Guobo Xie
- School of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Jianming Li
- School of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Guosheng Gu
- School of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Yuping Sun
- School of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Zhiyi Lin
- School of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Yinting Zhu
- School of Computer Science, Guangdong University of Technology, Guangzhou, China.
| | - Weiming Wang
- School of Computer Science, Guangdong University of Technology, Guangzhou, China.
| |
Collapse
|
7
|
Eshun RB, Kamrul Islam AKM, Bikdash MU. Identification of Significantly Expressed Gene Mutations for Automated Classification of Benign and Malignant Prostate Cancer. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:2437-2443. [PMID: 34891773 DOI: 10.1109/embc46164.2021.9630460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Among males, prostate cancer (Pca) is the cancer type with the highest prevalence and the second leading cause of cancer deaths. The current screening methods for prostate cancer lack effectiveness such as prostate-specific antigen (PSA) and digital rectal exam (DRE). Machine learning models have been used to predict Pca progression, Gleason score, and laterality. In this research paper, we have employed novel Machine learning techniques such as Bayesian approach, Support vector machines (SVM), Decision Trees, Logistic Regression, K-Nearest Neighbors, Random Forest and AdaBoost for detecting malignant prostate cancers from benign ones. Moreover, different feature extracting strategies are proposed to improve the detection performance and identify potential genomic biomarkers. The results show the Lasso feature set yielded high performance from the models with SVM achieving exemplary classification accuracy of 97%. The Lasso and SVM combination reported many significant biomarker genes and gene mutations including but not restricted to CA2320112, CA2328529, and CA2436168.
Collapse
|
8
|
Kourou K, Exarchos KP, Papaloukas C, Sakaloglou P, Exarchos T, Fotiadis DI. Applied machine learning in cancer research: A systematic review for patient diagnosis, classification and prognosis. Comput Struct Biotechnol J 2021; 19:5546-5555. [PMID: 34712399 PMCID: PMC8523813 DOI: 10.1016/j.csbj.2021.10.006] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2021] [Revised: 10/04/2021] [Accepted: 10/04/2021] [Indexed: 02/08/2023] Open
Abstract
Artificial Intelligence (AI) has recently altered the landscape of cancer research and medical oncology using traditional Machine Learning (ML) algorithms and cutting-edge Deep Learning (DL) architectures. In this review article we focus on the ML aspect of AI applications in cancer research and present the most indicative studies with respect to the ML algorithms and data used. The PubMed and dblp databases were considered to obtain the most relevant research works of the last five years. Based on a comparison of the proposed studies and their research clinical outcomes concerning the medical ML application in cancer research, three main clinical scenarios were identified. We give an overview of the well-known DL and Reinforcement Learning (RL) methodologies, as well as their application in clinical practice, and we briefly discuss Systems Biology in cancer research. We also provide a thorough examination of the clinical scenarios with respect to disease diagnosis, patient classification and cancer prognosis and survival. The most relevant studies identified in the preceding year are presented along with their primary findings. Furthermore, we examine the effective implementation and the main points that need to be addressed in the direction of robustness, explainability and transparency of predictive models. Finally, we summarize the most recent advances in the field of AI/ML applications in cancer research and medical oncology, as well as some of the challenges and open issues that need to be addressed before data-driven models can be implemented in healthcare systems to assist physicians in their daily practice.
Collapse
Affiliation(s)
- Konstantina Kourou
- Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina, Greece
- Foundation for Research and Technology-Hellas, Institute of Molecular Biology and Biotechnology, Dept. of Biomedical Research, Ioannina GR45110, Greece
| | | | - Costas Papaloukas
- Dept. of Biological Applications and Technology, University of Ioannina, Ioannina, Greece
| | - Prodromos Sakaloglou
- Dept. of Precision and Molecular Medicine, Unit of Liquid Biopsy in Oncology, Ioannina University Hospital, Ioannina, Greece
- Laboratory of Medical Genetics in Clinical Practice, School of Health Sciences, Faculty of Medicine, University of Ioannina, Ioannina, Greece
| | | | - Dimitrios I. Fotiadis
- Unit of Medical Technology and Intelligent Information Systems, Dept. of Materials Science and Engineering, University of Ioannina, Ioannina, Greece
- Foundation for Research and Technology-Hellas, Institute of Molecular Biology and Biotechnology, Dept. of Biomedical Research, Ioannina GR45110, Greece
| |
Collapse
|
9
|
Liu J, Zhang J, Huang H, Wang Y, Zhang Z, Ma Y, He X. A Machine Learning Model to Predict Intravenous Immunoglobulin-Resistant Kawasaki Disease Patients: A Retrospective Study Based on the Chongqing Population. Front Pediatr 2021; 9:756095. [PMID: 34820343 PMCID: PMC8606736 DOI: 10.3389/fped.2021.756095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Accepted: 10/18/2021] [Indexed: 11/13/2022] Open
Abstract
Objective: We explored the risk factors for intravenous immunoglobulin (IVIG) resistance in children with Kawasaki disease (KD) and constructed a prediction model based on machine learning algorithms. Methods: A retrospective study including 1,398 KD patients hospitalized in 7 affiliated hospitals of Chongqing Medical University from January 2015 to August 2020 was conducted. All patients were divided into IVIG-responsive and IVIG-resistant groups, which were randomly divided into training and validation sets. The independent risk factors were determined using logistic regression analysis. Logistic regression nomograms, support vector machine (SVM), XGBoost and LightGBM prediction models were constructed and compared with the previous models. Results: In total, 1,240 out of 1,398 patients were IVIG responders, while 158 were resistant to IVIG. According to the results of logistic regression analysis of the training set, four independent risk factors were identified, including total bilirubin (TBIL) (OR = 1.115, 95% CI 1.067-1.165), procalcitonin (PCT) (OR = 1.511, 95% CI 1.270-1.798), alanine aminotransferase (ALT) (OR = 1.013, 95% CI 1.008-1.018) and platelet count (PLT) (OR = 0.998, 95% CI 0.996-1). Logistic regression nomogram, SVM, XGBoost, and LightGBM prediction models were constructed based on the above independent risk factors. The sensitivity was 0.617, 0.681, 0.638, and 0.702, the specificity was 0.712, 0.841, 0.967, and 0.903, and the area under curve (AUC) was 0.731, 0.814, 0.804, and 0.874, respectively. Among the prediction models, the LightGBM model displayed the best ability for comprehensive prediction, with an AUC of 0.874, which surpassed the previous classic models of Egami (AUC = 0.581), Kobayashi (AUC = 0.524), Sano (AUC = 0.519), Fu (AUC = 0.578), and Formosa (AUC = 0.575). Conclusion: The machine learning LightGBM prediction model for IVIG-resistant KD patients was superior to previous models. Our findings may help to accomplish early identification of the risk of IVIG resistance and improve their outcomes.
Collapse
Affiliation(s)
- Jie Liu
- School of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Jian Zhang
- School of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Haodong Huang
- School of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Yunting Wang
- School of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Zuyue Zhang
- Medical Data Science Academy, Chongqing Medical University, Chongqing, China
| | - Yunfeng Ma
- School of Medical Informatics, Chongqing Medical University, Chongqing, China
| | - Xiangqian He
- School of Medical Informatics, Chongqing Medical University, Chongqing, China
| |
Collapse
|
10
|
Liu J, Xia C, Wang G. Multi-Omics Analysis in Initiation and Progression of Meningiomas: From Pathogenesis to Diagnosis. Front Oncol 2020; 10:1491. [PMID: 32983987 PMCID: PMC7484374 DOI: 10.3389/fonc.2020.01491] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Accepted: 07/13/2020] [Indexed: 12/31/2022] Open
Abstract
Meningiomas are common intracranial tumors that can be cured by surgical resection in most cases. However, the most disconcerting is high-grade meningiomas, which frequently recur despite initial successful treatment, eventually conferring poor prognosis. Therefore, the early diagnosis and classification of meningioma is necessary for the subsequent intervention and an improved prognosis. A growing body of evidence demonstrates the potential of multi-omics study (including genomics, transcriptomics, epigenomics, proteomics) for meningioma diagnosis and mechanistic links to potential pathological mechanism. This thesis addresses a neglected aspect of recent advances in the field of meningiomas at multiple omics levels, highlighting that the integration of multi-omics can reveal the mechanism of meningiomas, which provides a timely and necessary scientific basis for the treatment of meningiomas.
Collapse
Affiliation(s)
- Jiachen Liu
- Clinical Medicine, Xiangya Medical College of Central South University, Changsha, China
| | - Congcong Xia
- Clinical Medicine, Xiangya Medical College of Central South University, Changsha, China
| | - Gaiqing Wang
- Department of Neurology, Sanya Central Hospital (The Third People's Hospital of Hainan Province), Sanya, China
| |
Collapse
|