1
|
Wang Z, Wei Z. PT-KGNN: A framework for pre-training biomedical knowledge graphs with graph neural networks. Comput Biol Med 2024; 178:108768. [PMID: 38936076 DOI: 10.1016/j.compbiomed.2024.108768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 05/23/2024] [Accepted: 06/15/2024] [Indexed: 06/29/2024]
Abstract
Biomedical knowledge graphs (KGs) serve as comprehensive data repositories that contain rich information about nodes and edges, providing modeling capabilities for complex relationships among biological entities. Many approaches either learn node features through traditional machine learning methods, or leverage graph neural networks (GNNs) to directly learn features of target nodes in the biomedical KGs and utilize them for downstream tasks. Motivated by the pre-training technique in natural language processing (NLP), we propose a framework named PT-KGNN (Pre-Training the biomedical KG with GNNs) to learn embeddings of nodes in a broader context by applying GNNs on the biomedical KG. We design several experiments to evaluate the effectivity of our proposed framework and the impact of the scale of KGs. The results of tasks consistently improve as the scale of the biomedical KG used for pre-training increases. Pre-training on large-scale biomedical KGs significantly enhances the drug-drug interaction (DDI) and drug-disease association (DDA) prediction performance on the independent dataset. The embeddings derived from a larger biomedical KG have demonstrated superior performance compared to those obtained from a smaller KG. By applying pre-training techniques on biomedical KGs, rich semantic and structural information can be learned, leading to enhanced performance on downstream tasks. it is evident that pre-training techniques hold tremendous potential and wide-ranging applications in bioinformatics.
Collapse
Affiliation(s)
- Zhenxing Wang
- School of Data Science, Fudan University, 220 Handan Rd., Shanghai, 200433, China.
| | - Zhongyu Wei
- School of Data Science, Fudan University, 220 Handan Rd., Shanghai, 200433, China.
| |
Collapse
|
2
|
Wu H, Liu J, Zhang R, Lu Y, Cui G, Cui Z, Ding Y. A review of deep learning methods for ligand based drug virtual screening. FUNDAMENTAL RESEARCH 2024; 4:715-737. [PMID: 39156568 PMCID: PMC11330120 DOI: 10.1016/j.fmre.2024.02.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Revised: 01/10/2024] [Accepted: 02/18/2024] [Indexed: 08/20/2024] Open
Abstract
Drug discovery is costly and time consuming, and modern drug discovery endeavors are progressively reliant on computational methodologies, aiming to mitigate temporal and financial expenditures associated with the process. In particular, the time required for vaccine and drug discovery is prolonged during emergency situations such as the coronavirus 2019 pandemic. Recently, the performance of deep learning methods in drug virtual screening has been particularly prominent. It has become a concern for researchers how to summarize the existing deep learning in drug virtual screening, select different models for different drug screening problems, exploit the advantages of deep learning models, and further improve the capability of deep learning in drug virtual screening. This review first introduces the basic concepts of drug virtual screening, common datasets, and data representation methods. Then, large numbers of common deep learning methods for drug virtual screening are compared and analyzed. In addition, a dataset of different sizes is constructed independently to evaluate the performance of each deep learning model for the difficult problem of large-scale ligand virtual screening. Finally, the existing challenges and future directions in the field of virtual screening are presented.
Collapse
Affiliation(s)
- Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Junkai Liu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Runhua Zhang
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Yaoyao Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Guozeng Cui
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Zhiming Cui
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| |
Collapse
|
3
|
Wang Y, Song J, Dai Q, Duan X. Hierarchical Negative Sampling Based Graph Contrastive Learning Approach for Drug-Disease Association Prediction. IEEE J Biomed Health Inform 2024; 28:3146-3157. [PMID: 38294927 DOI: 10.1109/jbhi.2024.3360437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2024]
Abstract
Predicting potential drug-disease associations (RDAs) plays a pivotal role in elucidating therapeutic strategies for diseases and facilitating drug repositioning, making it of paramount importance. However, existing methods are constrained and rely heavily on limited domain-specific knowledge, impeding their ability to effectively predict candidate associations between drugs and diseases. Moreover, the simplistic definition of unknown information pertaining to drug-disease relationships as negative samples presents inherent limitations. To overcome these challenges, we introduce a novel hierarchical negative sampling-based graph contrastive model, termed HSGCLRDA, which aims to forecast latent associations between drugs and diseases. In this study, HSGCLRDA integrates the association information as well as similarity between drugs, diseases and proteins. Meanwhile, the model constructs a drug-disease-protein heterogeneous network. Subsequently, employing a hierarchical structural sampling technique, we establish reliable negative drug-disease samples utilizing PageRank algorithms. Utilizing meta-path aggregation within the heterogeneous network, we derive low-dimensional representations for drugs and diseases, thereby constructing global and local feature graphs that capture their interactions comprehensively. To obtain representation information, we adopt a self-supervised graph contrastive approach that leverages graph convolutional networks (GCNs) and second-order GCNs to extract feature graph information. Furthermore, we integrate a contrastive cost function derived from the cross-entropy cost function, facilitating holistic model optimization. Experimental results obtained from benchmark datasets not only showcase the superior performance of HSGCLRDA compared to various baseline methods in predicting RDAs but also emphasize its practical utility in identifying novel potential diseases associated with existing drugs through meticulous case studies.
Collapse
|
4
|
Tian C, Wang L, Cui Z, Wu H. GTAMP-DTA: Graph transformer combined with attention mechanism for drug-target binding affinity prediction. Comput Biol Chem 2024; 108:107982. [PMID: 38039800 DOI: 10.1016/j.compbiolchem.2023.107982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 10/21/2023] [Accepted: 11/07/2023] [Indexed: 12/03/2023]
Abstract
Drug target affinity prediction (DTA) is critical to the success of drug development. While numerous machine learning methods have been developed for this task, there remains a necessity to further enhance the accuracy and reliability of predictions. Considerable bias in drug target binding prediction may result due to missing structural information or missing information. In addition, current methods focus only on simulating individual non-covalent interactions between drugs and proteins, thereby neglecting the intricate interplay among different drugs and their interactions with proteins. GTAMP-DTA combines special Attention mechanisms, assigning each atom or amino acid an attention vector. Interactions between drug forms and protein forms were considered to capture information about their interactions. And fusion transformer was used to learn protein characterization from raw amino acid sequences, which were then merged with molecular map features extracted from SMILES. A self-supervised pre-trained embedding that uses pre-trained transformers to encode drug and protein attributes is introduced in order to address the lack of labeled data. Experimental results demonstrate that our model outperforms state-of-the-art methods on both the Davis and KIBA datasets. Additionally, the model's performance undergoes evaluation using three distinct pooling layers (max-pooling, mean-pooling, sum-pooling) along with variations of the attention mechanism. GTAMP-DTA shows significant performance improvements compared to other methods.
Collapse
Affiliation(s)
- Chuangchuang Tian
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Luping Wang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Zhiming Cui
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China; Suzhou Smart City Research Institute, Suzhou University of Science and Technology, Suzhou 215009, China.
| |
Collapse
|
5
|
Wu H, Liu J, Jiang T, Zou Q, Qi S, Cui Z, Tiwari P, Ding Y. AttentionMGT-DTA: A multi-modal drug-target affinity prediction using graph transformer and attention mechanism. Neural Netw 2024; 169:623-636. [PMID: 37976593 DOI: 10.1016/j.neunet.2023.11.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 11/19/2023]
Abstract
The accurate prediction of drug-target affinity (DTA) is a crucial step in drug discovery and design. Traditional experiments are very expensive and time-consuming. Recently, deep learning methods have achieved notable performance improvements in DTA prediction. However, one challenge for deep learning-based models is appropriate and accurate representations of drugs and targets, especially the lack of effective exploration of target representations. Another challenge is how to comprehensively capture the interaction information between different instances, which is also important for predicting DTA. In this study, we propose AttentionMGT-DTA, a multi-modal attention-based model for DTA prediction. AttentionMGT-DTA represents drugs and targets by a molecular graph and binding pocket graph, respectively. Two attention mechanisms are adopted to integrate and interact information between different protein modalities and drug-target pairs. The experimental results showed that our proposed model outperformed state-of-the-art baselines on two benchmark datasets. In addition, AttentionMGT-DTA also had high interpretability by modeling the interaction strength between drug atoms and protein residues. Our code is available at https://github.com/JK-Liu7/AttentionMGT-DTA.
Collapse
Affiliation(s)
- Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Junkai Liu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China; Yangtze Delta Region Institute(Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, China.
| | - Tengsheng Jiang
- Gusu School, Nanjing Medical University, Suzhou, 215009, China.
| | - Quan Zou
- Yangtze Delta Region Institute(Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, China.
| | - Shujie Qi
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Zhiming Cui
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| | - Prayag Tiwari
- School of Information Technology, Halmstad University, Sweden.
| | - Yijie Ding
- Yangtze Delta Region Institute(Quzhou), University of Electronic Science and Technology of China, Quzhou, 324003, China.
| |
Collapse
|
6
|
Li D, Xiao Z, Sun H, Jiang X, Zhao W, Shen X. Prediction of Drug-Disease Associations Based on Multi-Kernel Deep Learning Method in Heterogeneous Graph Embedding. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:120-128. [PMID: 38051617 DOI: 10.1109/tcbb.2023.3339189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Computational drug repositioning can identify potential associations between drugs and diseases. This technology has been shown to be effective in accelerating drug development and reducing experimental costs. Although there has been plenty of research for this task, existing methods are deficient in utilizing complex relationships among biological entities, which may not be conducive to subsequent simulation of drug treatment processes. In this article, we propose a heterogeneous graph embedding method called HMLKGAT to infer novel potential drugs for diseases. More specifically, we first construct a heterogeneous information network by combining drug-disease, drug-protein and disease-protein biological networks. Then, a multi-layer graph attention model is utilized to capture the complex associations in the network to derive representations for drugs and diseases. Finally, to maintain the relationship of nodes in different feature spaces, we propose a multi-kernel learning method to transform and combine the representations. Experimental results demonstrate that HMLKGAT outperforms six state-of-the-art methods in drug-related disease prediction, and case studies of five classical drugs further demonstrate the effectiveness of HMLKGAT.
Collapse
|
7
|
Wang Y, Zhang J, Jin J, Wei L. MolCAP: Molecular Chemical reActivity Pretraining and prompted-finetuning enhanced molecular representation learning. Comput Biol Med 2023; 167:107666. [PMID: 37956623 DOI: 10.1016/j.compbiomed.2023.107666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/19/2023] [Accepted: 10/31/2023] [Indexed: 11/15/2023]
Abstract
Molecular representation learning (MRL) is a fundamental task for drug discovery. However, previous deep-learning (DL) methods focus excessively on learning robust inner-molecular representations by mask-dominated pretraining frameworks, neglecting abundant chemical reactivity molecular relationships that have been demonstrated as the determining factor for various molecular property prediction tasks. Here, we present MolCAP to promote MRL, a graph-pretraining Transformer based on chemical reactivity (IMR) knowledge with prompted finetuning. Results show that MolCAP outperforms comparative methods based on traditional molecular pretraining frameworks, in 13 publicly available molecular datasets across a diversity of biomedical tasks. Prompted by MolCAP, even basic graph neural networks are capable of achieving surprising performance that outperforms previous models, indicating the promising prospect of applying reactivity information to MRL. In addition, manually designed molecular templets are potential to uncover the dataset bias. All in all, we expect our MolCAP to gain more chemical meaningful insights for the entire process of drug discovery.
Collapse
Affiliation(s)
- Yu Wang
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Jingjie Zhang
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Junru Jin
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, 250101, China.
| |
Collapse
|
8
|
Ding Y, Zhou H, Zou Q, Yuan L. Identification of drug-side effect association via correntropy-loss based matrix factorization with neural tangent kernel. Methods 2023; 219:73-81. [PMID: 37783242 DOI: 10.1016/j.ymeth.2023.09.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 09/18/2023] [Accepted: 09/20/2023] [Indexed: 10/04/2023] Open
Abstract
Adverse drug reactions include side effects, allergic reactions, and secondary infections. Severe adverse reactions can cause cancer, deformity, or mutation. The monitoring of drug side effects is an important support for post marketing safety supervision of drugs, and an important basis for revising drug instructions. Its purpose is to timely detect and control drug safety risks. Traditional methods are time-consuming. To accelerate the discovery of side effects, we propose a machine learning based method, called correntropy-loss based matrix factorization with neural tangent kernel (CLMF-NTK), to solve the prediction of drug side effects. Our method and other computational methods are tested on three benchmark datasets, and the results show that our method achieves the best predictive performance.
Collapse
Affiliation(s)
- Yijie Ding
- Key Laboratory of Computational Science and Application of Hainan Province, Hainan Normal University, Haikou 571158, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China; School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China
| | - Hongmei Zhou
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Quan Zou
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China.
| | - Lei Yuan
- Department of Hepatobiliary Surgery, Quzhou People's Hospital, 100# Minjiang Main Road, Quzhou 324000, China.
| |
Collapse
|
9
|
Ye S, Zhao W, Shen X, Jiang X, He T. An effective multi-task learning framework for drug repurposing based on graph representation learning. Methods 2023; 218:48-56. [PMID: 37516260 DOI: 10.1016/j.ymeth.2023.07.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 07/04/2023] [Accepted: 07/20/2023] [Indexed: 07/31/2023] Open
Abstract
Drug repurposing, which typically applies the procedure of drug-disease associations (DDAs) prediction, is a feasible solution to drug discovery. Compared with traditional methods, drug repurposing can reduce the cost and time for drug development and advance the success rate of drug discovery. Although many methods for drug repurposing have been proposed and the obtained results are relatively acceptable, there is still some room for improving the predictive performance, since those methods fail to consider fully the issue of sparseness in known drug-disease associations. In this paper, we propose a novel multi-task learning framework based on graph representation learning to identify DDAs for drug repurposing. In our proposed framework, a heterogeneous information network is first constructed by combining multiple biological datasets. Then, a module consisting of multiple layers of graph convolutional networks is utilized to learn low-dimensional representations of nodes in the constructed heterogeneous information network. Finally, two types of auxiliary tasks are designed to help to train the target task of DDAs prediction in the multi-task learning framework. Comprehensive experiments are conducted on real data and the results demonstrate the effectiveness of the proposed method for drug repurposing.
Collapse
Affiliation(s)
- Shengwei Ye
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei 430079, PR China; School of Computer, Central China Normal University, Wuhan, Hubei 430079, PR China; National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, Hubei 430079, PR China
| | - Weizhong Zhao
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei 430079, PR China; School of Computer, Central China Normal University, Wuhan, Hubei 430079, PR China; National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, Hubei 430079, PR China.
| | - Xianjun Shen
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei 430079, PR China; School of Computer, Central China Normal University, Wuhan, Hubei 430079, PR China; National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, Hubei 430079, PR China
| | - Xingpeng Jiang
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei 430079, PR China; School of Computer, Central China Normal University, Wuhan, Hubei 430079, PR China; National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, Hubei 430079, PR China
| | - Tingting He
- Hubei Provincial Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, Hubei 430079, PR China; School of Computer, Central China Normal University, Wuhan, Hubei 430079, PR China; National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, Hubei 430079, PR China
| |
Collapse
|
10
|
Ai C, Yang H, Ding Y, Tang J, Guo F. Low Rank Matrix Factorization Algorithm Based on Multi-Graph Regularization for Detecting Drug-Disease Association. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3033-3043. [PMID: 37159322 DOI: 10.1109/tcbb.2023.3274587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Detecting potential associations between drugs and diseases plays an indispensable role in drug development, which has also become a research hotspot in recent years. Compared with traditional methods, some computational approaches have the advantages of fast speed and low cost, which greatly accelerate the progress of predicting the drug-disease association. In this study, we propose a novel similarity-based method of low-rank matrix decomposition based on multi-graph regularization. On the basis of low-rank matrix factorization with L2 regularization, the multi-graph regularization constraint is constructed by combining a variety of similarity matrices from drugs and diseases respectively. In the experiments, we analyze the difference in the combination of different similarities, resulting that combining all the similarity information on drug space is unnecessary, and only a part of the similarity information can achieve the desired performance. Then our method is compared with other existing models on three data sets (Fdataset, Cdataset and LRSSLdataset) and have a good advantage in the evaluation measurement of AUPR. Besides, a case study experiment is conducted and showing that the superior ability for predicting the potential disease-related drugs of our model. Finally, we compare our model with some methods on six real world datasets, and our model has a good performance in detecting real world data.
Collapse
|
11
|
Liu Y, Guan S, Jiang T, Fu Q, Ma J, Cui Z, Ding Y, Wu H. DNA protein binding recognition based on lifelong learning. Comput Biol Med 2023; 164:107094. [PMID: 37459792 DOI: 10.1016/j.compbiomed.2023.107094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Revised: 05/09/2023] [Accepted: 05/27/2023] [Indexed: 09/09/2023]
Abstract
In recent years, research in the field of bioinformatics has focused on predicting the raw sequences of proteins, and some scholars consider DNA-binding protein prediction as a classification task. Many statistical and machine learning-based methods have been widely used in DNA-binding proteins research. The aforementioned methods are indeed more efficient than those based on manual classification, but there is still room for improvement in terms of prediction accuracy and speed. In this study, researchers used Average Blocks, Discrete Cosine Transform, Discrete Wavelet Transform, Global encoding, Normalized Moreau-Broto Autocorrelation and Pseudo position-specific scoring matrix to extract evolutionary features. A dynamic deep network based on lifelong learning architecture was then proposed in order to fuse six features and thus allow for more efficient classification of DNA-binding proteins. The multi-feature fusion allows for a more accurate description of the desired protein information than single features. This model offers a fresh perspective on the dichotomous classification problem in bioinformatics and broadens the application field of lifelong learning. The researchers ran trials on three datasets and contrasted them with other classification techniques to show the model's effectiveness in this study. The findings demonstrated that the model used in this research was superior to other approaches in terms of single-sample specificity (81.0%, 83.0%) and single-sample sensitivity (82.4%, 90.7%), and achieves high accuracy on the benchmark dataset (88.4%, 80.0%, and 76.6%).
Collapse
Affiliation(s)
- Yongsan Liu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - ShiXuan Guan
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - TengSheng Jiang
- Gusu School, Nanjing Medical University, Suzhou, Jiangsu, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Jieming Ma
- School of Intelligent Engineering, Xijiao Liverpool University, Suzhou, 215123, China
| | - Zhiming Cui
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China
| | - Yijie Ding
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, 215009, China.
| |
Collapse
|
12
|
Qian Y, Shang T, Guo F, Wang C, Cui Z, Ding Y, Wu H. Identification of DNA-binding protein based multiple kernel model. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:13149-13170. [PMID: 37501482 DOI: 10.3934/mbe.2023586] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
DNA-binding proteins (DBPs) play a critical role in the development of drugs for treating genetic diseases and in DNA biology research. It is essential for predicting DNA-binding proteins more accurately and efficiently. In this paper, a Laplacian Local Kernel Alignment-based Restricted Kernel Machine (LapLKA-RKM) is proposed to predict DBPs. In detail, we first extract features from the protein sequence using six methods. Second, the Radial Basis Function (RBF) kernel function is utilized to construct pre-defined kernel metrics. Then, these metrics are combined linearly by weights calculated by LapLKA. Finally, the fused kernel is input to RKM for training and prediction. Independent tests and leave-one-out cross-validation were used to validate the performance of our method on a small dataset and two large datasets. Importantly, we built an online platform to represent our model, which is now freely accessible via http://8.130.69.121:8082/.
Collapse
Affiliation(s)
- Yuqing Qian
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tingting Shang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chunliang Wang
- The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zhiming Cui
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
13
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
14
|
Fan R, Suo B, Ding Y. Identification of Vesicle Transport Proteins via Hypergraph Regularized K-Local Hyperplane Distance Nearest Neighbour Model. Front Genet 2022; 13:960388. [PMID: 35910197 PMCID: PMC9326258 DOI: 10.3389/fgene.2022.960388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 06/22/2022] [Indexed: 12/04/2022] Open
Abstract
The prediction of protein function is a common topic in the field of bioinformatics. In recent years, advances in machine learning have inspired a growing number of algorithms for predicting protein function. A large number of parameters and fairly complex neural networks are often used to improve the prediction performance, an approach that is time-consuming and costly. In this study, we leveraged traditional features and machine learning classifiers to boost the performance of vesicle transport protein identification and make the prediction process faster. We adopt the pseudo position-specific scoring matrix (PsePSSM) feature and our proposed new classifier hypergraph regularized k-local hyperplane distance nearest neighbour (HG-HKNN) to classify vesicular transport proteins. We address dataset imbalances with random undersampling. The results show that our strategy has an area under the receiver operating characteristic curve (AUC) of 0.870 and a Matthews correlation coefficient (MCC) of 0.53 on the benchmark dataset, outperforming all state-of-the-art methods on the same dataset, and other metrics of our model are also comparable to existing methods.
Collapse
Affiliation(s)
- Rui Fan
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Bing Suo
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
15
|
Ai C, Yang H, Ding Y, Tang J, Guo F. A multi-layer multi-kernel neural network for determining associations between non-coding RNAs and diseases. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.04.068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
16
|
Ding Y, Tang J, Guo F, Zou Q. Identification of drug-target interactions via multiple kernel-based triple collaborative matrix factorization. Brief Bioinform 2022; 23:6520305. [PMID: 35134117 DOI: 10.1093/bib/bbab582] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2021] [Revised: 12/02/2021] [Accepted: 12/19/2021] [Indexed: 12/15/2022] Open
Abstract
Targeted drugs have been applied to the treatment of cancer on a large scale, and some patients have certain therapeutic effects. It is a time-consuming task to detect drug-target interactions (DTIs) through biochemical experiments. At present, machine learning (ML) has been widely applied in large-scale drug screening. However, there are few methods for multiple information fusion. We propose a multiple kernel-based triple collaborative matrix factorization (MK-TCMF) method to predict DTIs. The multiple kernel matrices (contain chemical, biological and clinical information) are integrated via multi-kernel learning (MKL) algorithm. And the original adjacency matrix of DTIs could be decomposed into three matrices, including the latent feature matrix of the drug space, latent feature matrix of the target space and the bi-projection matrix (used to join the two feature spaces). To obtain better prediction performance, MKL algorithm can regulate the weight of each kernel matrix according to the prediction error. The weights of drug side-effects and target sequence are the highest. Compared with other computational methods, our model has better performance on four test data sets.
Collapse
Affiliation(s)
- Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, P.R.China
| | - Jijun Tang
- Department of Computational Science and Engineering, University of South Carolina, Columbia, U.S
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, P.R.China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, P.R.China
| |
Collapse
|
17
|
Jia Y, Huang S, Zhang T. KK-DBP: A Multi-Feature Fusion Method for DNA-Binding Protein Identification Based on Random Forest. Front Genet 2021; 12:811158. [PMID: 34912382 PMCID: PMC8667860 DOI: 10.3389/fgene.2021.811158] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 11/15/2021] [Indexed: 02/04/2023] Open
Abstract
DNA-binding protein (DBP) is a protein with a special DNA binding domain that is associated with many important molecular biological mechanisms. Rapid development of computational methods has made it possible to predict DBP on a large scale; however, existing methods do not fully integrate DBP-related features, resulting in rough prediction results. In this article, we develop a DNA-binding protein identification method called KK-DBP. To improve prediction accuracy, we propose a feature extraction method that fuses multiple PSSM features. The experimental results show a prediction accuracy on the independent test dataset PDB186 of 81.22%, which is the highest of all existing methods.
Collapse
Affiliation(s)
- Yuran Jia
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Tianjiao Zhang
- College of Information and Computer Engineering, Northeast Forestry University, Harbin, China
| |
Collapse
|
18
|
Lin X. Genomic Variation Prediction: A Summary From Different Views. Front Cell Dev Biol 2021; 9:795883. [PMID: 34901036 PMCID: PMC8656232 DOI: 10.3389/fcell.2021.795883] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Accepted: 11/11/2021] [Indexed: 12/02/2022] Open
Abstract
Structural variations in the genome are closely related to human health and the occurrence and development of various diseases. To understand the mechanisms of diseases, find pathogenic targets, and carry out personalized precision medicine, it is critical to detect such variations. The rapid development of high-throughput sequencing technologies has accelerated the accumulation of large amounts of genomic mutation data, including synonymous mutations. Identifying pathogenic synonymous mutations that play important roles in the occurrence and development of diseases from all the available mutation data is of great importance. In this paper, machine learning theories and methods are reviewed, efficient and accurate pathogenic synonymous mutation prediction methods are developed, and a standardized three-level variant analysis framework is constructed. In addition, multiple variation tolerance prediction models are studied and integrated, and new ideas for structural variation detection based on deep information mining are explored.
Collapse
Affiliation(s)
- Xiuchun Lin
- College of Information and Electrical Engineering, China Agricultural University, Beijing, China
| |
Collapse
|