1
|
Chen J, Yang X, Wu H. A Multibranch Neural Network for Drug-Target Affinity Prediction Using Similarity Information. ACS OMEGA 2024; 9:35978-35989. [PMID: 39184467 PMCID: PMC11339836 DOI: 10.1021/acsomega.4c05607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 08/03/2024] [Accepted: 08/06/2024] [Indexed: 08/27/2024]
Abstract
Predicting drug-target affinity (DTA) is beneficial for accelerating drug discovery. In recent years, graph structure-based deep learning models have garnered significant attention in this field. However, these models typically handle drug or target protein in isolation and only extract the molecular structure information on the drug or protein itself. To address this limitation, existing network-based models represent drug-target interactions or affinities as a knowledge graph to capture the interaction information. In this study, we propose a novel solution. Specifically, we introduce drug similarity information and protein similarity information into the field of DTA prediction. Moreover, we propose a network framework that autonomously extracts similarity information, avoiding reliance on knowledge graphs. Based on this framework, we design a multibranch neural network called GASI-DTA. This network integrates similarity information, sequence information, and molecular structure information. Comprehensive experimental results conducted on two benchmark data sets and three cold-start scenarios demonstrate that our model outperforms state-of-the-art graph structure-based methods in nearly all metrics. Furthermore, it exhibits significant advantages over existing network-based models, outperforming the best of them in the majority of metrics. Our study's code and data are openly accessible at http://github.com/XiaoLin-Yang-S/GASI-DTA.
Collapse
Affiliation(s)
- Jing Chen
- School
of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
- Jiangsu
Provincial Engineering Laboratory of Pattern Recognition and Computing
Intelligence, Jiangnan University, Wuxi 214122, China
| | - Xiaolin Yang
- School
of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
| | - Haoyu Wu
- School
of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
| |
Collapse
|
2
|
Jisna VA, Ajay AP, Jayaraj PB. Using Attention-UNet Models to Predict Protein Contact Maps. J Comput Biol 2024; 31:691-702. [PMID: 38979621 DOI: 10.1089/cmb.2023.0102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024] Open
Abstract
Proteins are essential to life, and understanding their intrinsic roles requires determining their structure. The field of proteomics has opened up new opportunities by applying deep learning algorithms to large databases of solved protein structures. With the availability of large data sets and advanced machine learning methods, the prediction of protein residue interactions has greatly improved. Protein contact maps provide empirical evidence of the interacting residue pairs within a protein sequence. Template-free protein structure prediction systems rely heavily on this information. This article proposes UNet-CON, an attention-integrated UNet architecture, trained to predict residue-residue contacts in protein sequences. With the predicted contacts being more accurate than state-of-the-art methods on the PDB25 test set, the model paves the way for the development of more powerful deep learning algorithms for predicting protein residue interactions.
Collapse
Affiliation(s)
- V A Jisna
- Department of Computer Science and Engineering, Indian Institute of Information Technology Design and Manufacturing, Kurnool, India
| | | | - P B Jayaraj
- Department of Computer Science and Engineering, NIT Calicut, Calicut, India
| |
Collapse
|
3
|
Zhang H, Liu X, Cheng W, Wang T, Chen Y. Prediction of drug-target binding affinity based on deep learning models. Comput Biol Med 2024; 174:108435. [PMID: 38608327 DOI: 10.1016/j.compbiomed.2024.108435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 04/05/2024] [Accepted: 04/07/2024] [Indexed: 04/14/2024]
Abstract
The prediction of drug-target binding affinity (DTA) plays an important role in drug discovery. Computerized virtual screening techniques have been used for DTA prediction, greatly reducing the time and economic costs of drug discovery. However, these techniques have not succeeded in reversing the low success rate of new drug development. In recent years, the continuous development of deep learning (DL) technology has brought new opportunities for drug discovery through the DTA prediction. This shift has moved the prediction of DTA from traditional machine learning methods to DL. The DL frameworks used for DTA prediction include convolutional neural networks (CNN), graph convolutional neural networks (GCN), and recurrent neural networks (RNN), and reinforcement learning (RL), among others. This review article summarizes the available literature on DTA prediction using DL models, including DTA quantification metrics and datasets, and DL algorithms used for DTA prediction (including input representation of models, neural network frameworks, valuation indicators, and model interpretability). In addition, the opportunities, challenges, and prospects of the application of DL frameworks for DTA prediction in the field of drug discovery are discussed.
Collapse
Affiliation(s)
- Hao Zhang
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Xiaoqian Liu
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Wenya Cheng
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Tianshi Wang
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China
| | - Yuanyuan Chen
- College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
| |
Collapse
|
4
|
Gorantla R, Kubincová A, Weiße AY, Mey ASJS. From Proteins to Ligands: Decoding Deep Learning Methods for Binding Affinity Prediction. J Chem Inf Model 2024; 64:2496-2507. [PMID: 37983381 PMCID: PMC11005465 DOI: 10.1021/acs.jcim.3c01208] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 10/26/2023] [Accepted: 10/27/2023] [Indexed: 11/22/2023]
Abstract
Accurate in silico prediction of protein-ligand binding affinity is important in the early stages of drug discovery. Deep learning-based methods exist but have yet to overtake more conventional methods such as giga-docking largely due to their lack of generalizability. To improve generalizability, we need to understand what these models learn from input protein and ligand data. We systematically investigated a sequence-based deep learning framework to assess the impact of protein and ligand encodings on predicting binding affinities for commonly used kinase data sets. The role of proteins is studied using convolutional neural network-based encodings obtained from sequences and graph neural network-based encodings enriched with structural information from contact maps. Ligand-based encodings are generated from graph-neural networks. We test different ligand perturbations by randomizing node and edge properties. For proteins, we make use of 3 different protein contact generation methods (AlphaFold2, Pconsc4, and ESM-1b) and compare these with a random control. Our investigation shows that protein encodings do not substantially impact the binding predictions, with no statistically significant difference in binding affinity for KIBA in the investigated metrics (concordance index, Pearson's R Spearman's Rank, and RMSE). Significant differences are seen for ligand encodings with random ligands and random ligand node properties, suggesting a much bigger reliance on ligand data for the learning tasks. Using different ways to combine protein and ligand encodings did not show a significant change in performance.
Collapse
Affiliation(s)
- Rohan Gorantla
- School
of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, U.K.
- EaStCHEM
School of Chemistry, University of Edinburgh, Edinburgh, EH9 3FJ, U.K.
| | | | - Andrea Y. Weiße
- School
of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, U.K.
- School
of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3FF, U.K.
| | | |
Collapse
|
5
|
Sun Y, Li YY, Leung CK, Hu P. iNGNN-DTI: prediction of drug-target interaction with interpretable nested graph neural network and pretrained molecule models. Bioinformatics 2024; 40:btae135. [PMID: 38449285 PMCID: PMC10957515 DOI: 10.1093/bioinformatics/btae135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2023] [Revised: 12/31/2023] [Accepted: 03/05/2024] [Indexed: 03/08/2024] Open
Abstract
MOTIVATION Drug-target interaction (DTI) prediction aims to identify interactions between drugs and protein targets. Deep learning can automatically learn discriminative features from drug and protein target representations for DTI prediction, but challenges remain, making it an open question. Existing approaches encode drugs and targets into features using deep learning models, but they often lack explanations for underlying interactions. Moreover, limited labeled DTIs in the chemical space can hinder model generalization. RESULTS We propose an interpretable nested graph neural network for DTI prediction (iNGNN-DTI) using pre-trained molecule and protein models. The analysis is conducted on graph data representing drugs and targets by using a specific type of nested graph neural network, in which the target graphs are created based on 3D structures using Alphafold2. This architecture is highly expressive in capturing substructures of the graph data. We use a cross-attention module to capture interaction information between the substructures of drugs and targets. To improve feature representations, we integrate features learned by models that are pre-trained on large unlabeled small molecule and protein datasets, respectively. We evaluate our model on three benchmark datasets, and it shows a consistent improvement on all baseline models in all datasets. We also run an experiment with previously unseen drugs or targets in the test set, and our model outperforms all of the baselines. Furthermore, the iNGNN-DTI can provide more insights into the interaction by visualizing the weights learned by the cross-attention module. AVAILABILITY AND IMPLEMENTATION The source code of the algorithm is available at https://github.com/syan1992/iNGNN-DTI.
Collapse
Affiliation(s)
- Yan Sun
- Department of Biochemistry, Western University, London, ON, N6G 2V4, Canada
- Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada
- Department of Computer Science, Western University, London, ON, N6G 2V4, Canada
| | - Yan Yi Li
- Division of Biostatistics, University of Toronto, Toronto, ON, M5T 3M7, Canada
| | - Carson K Leung
- Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada
| | - Pingzhao Hu
- Department of Biochemistry, Western University, London, ON, N6G 2V4, Canada
- Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada
- Department of Computer Science, Western University, London, ON, N6G 2V4, Canada
- Division of Biostatistics, University of Toronto, Toronto, ON, M5T 3M7, Canada
- Department of Oncology, Western University, London, ON, N6G 2V4, Canada
- Department of Epidemiology and Biostatistics, Western University, London, ON, N6G 2V4, Canada
- The Children’s Health Research Institute, Lawson Health Research Institute, London, ON, N6A 4V2, Canada
| |
Collapse
|
6
|
Wei Q, Wang R, Jiang Y, Wei L, Sun Y, Geng J, Su R. ConPep: Prediction of peptide contact maps with pre-trained biological language model and multi-view feature extracting strategy. Comput Biol Med 2023; 167:107631. [PMID: 37948966 DOI: 10.1016/j.compbiomed.2023.107631] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 10/16/2023] [Accepted: 10/23/2023] [Indexed: 11/12/2023]
Abstract
The accurate prediction of peptide contact maps remains a challenging task due to the difficulty in obtaining the interactive information between residues on short sequences. To address this challenge, we propose ConPep, a deep learning framework designed for predicting the contact map of peptides based on sequences only. To sufficiently incorporate the sequential semantic information between residues in peptide sequences, we use a pre-trained biological language model and transfer prior knowledge from large scale databases. Additionally, to extract and integrate sequential local information and residue-based global correlations, our model incorporates Bidirectional Gated Recurrent Unit and attention mechanisms. They can obtain multi-view features and thus enhance the accuracy and robustness of our prediction. Comparative results on independent tests demonstrate that our proposed method significantly outperforms state-of-the-art methods even with short peptides. Notably, our method exhibits superior performance at the sequence level, suggesting the robust ability of our model compared with the multiple sequence alignment (MSA) analysis-based methods. We expect it can be meaningful research for facilitating the wide use of our method.
Collapse
Affiliation(s)
- Qingxin Wei
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Ruheng Wang
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Yi Jiang
- School of Software, Shandong University, Jinan, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan, China; Centre for Artificial Intelligence driven Drug Discovery, Faculty of Applied Science, Macao Polytechnic University, Macao SAR, China
| | - Yu Sun
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Jie Geng
- Department of Cardiology, Tianjin Chest Hospital, Tianjin, China.
| | - Ran Su
- College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
7
|
Wang J, Xiao Y, Shang X, Peng J. Predicting drug-target binding affinity with cross-scale graph contrastive learning. Brief Bioinform 2023; 25:bbad516. [PMID: 38221904 PMCID: PMC10788681 DOI: 10.1093/bib/bbad516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 12/04/2023] [Accepted: 12/07/2023] [Indexed: 01/16/2024] Open
Abstract
Identifying the binding affinity between a drug and its target is essential in drug discovery and repurposing. Numerous computational approaches have been proposed for understanding these interactions. However, most existing methods only utilize either the molecular structure information of drugs and targets or the interaction information of drug-target bipartite networks. They may fail to combine the molecule-scale and network-scale features to obtain high-quality representations. In this study, we propose CSCo-DTA, a novel cross-scale graph contrastive learning approach for drug-target binding affinity prediction. The proposed model combines features learned from the molecular scale and the network scale to capture information from both local and global perspectives. We conducted experiments on two benchmark datasets, and the proposed model outperformed existing state-of-art methods. The ablation experiment demonstrated the significance and efficacy of multi-scale features and cross-scale contrastive learning modules in improving the prediction performance. Moreover, we applied the CSCo-DTA to predict the novel potential targets for Erlotinib and validated the predicted targets with the molecular docking analysis.
Collapse
Affiliation(s)
- Jingru Wang
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
- The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology, Xi’an, 710072, China
| | - Yihang Xiao
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
- The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology, Xi’an, 710072, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
- Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
- The National Engineering Laboratory for Integrated Aerospace-Ground-Ocean Big Data Application Technology, Xi’an, 710072, China
- Research and Development Institute of Northwestern Polytechnical University in Shenzhen, Shenzhen, 518000, China
| |
Collapse
|
8
|
Suviriyapaisal N, Wichadakul D. iEdgeDTA: integrated edge information and 1D graph convolutional neural networks for binding affinity prediction. RSC Adv 2023; 13:25218-25228. [PMID: 37636509 PMCID: PMC10448119 DOI: 10.1039/d3ra03796g] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 08/14/2023] [Indexed: 08/29/2023] Open
Abstract
Artificial intelligence has become more prevalent in broad fields, including drug discovery, in which the process is costly and time-consuming when conducted through wet experiments. As a result, drug repurposing, which tries to utilize approved and low-risk drugs for a new purpose, becomes more attractive. However, screening candidates from many drugs for specific protein targets is still expensive and tedious. This study aims to leverage computational resources to aid drug discovery by utilizing drug-protein interaction data and estimating their interaction strength, so-called binding affinity. Our estimation approach addresses multiple challenges encountered in the field. First, we employed a graph-based deep learning technique to overcome the limitations of drug compounds represented in string format by incorporating background knowledge of node and edge information as separate multi-dimensional features. Second, we tackled the complexities associated with extracting the representation and structure of proteins by utilizing a pre-trained model for feature extraction. Also, we employed graph operations over the 1D representation of a protein sequence to overcome the fixed-length problem typically encountered in language model tasks. In addition, we conducted a comparative analysis with a baseline model that creates a protein graph from a contact map prediction model, giving valuable insights into the performance and effectiveness of our proposed method. We evaluated the performance of our model using the same benchmark datasets with a variety of matrices as other previous work, and the results show that our model achieved the best prediction results while requiring no contact map information compared to other graph-based methods.
Collapse
Affiliation(s)
- Natchanon Suviriyapaisal
- Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University Bangkok 10330 Thailand
| | - Duangdao Wichadakul
- Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University Bangkok 10330 Thailand
- Center of Excellence in Systems Biology, Faculty of Medicine, Chulalongkorn University Bangkok 10330 Thailand
| |
Collapse
|
9
|
Qian Y, Shang T, Guo F, Wang C, Cui Z, Ding Y, Wu H. Identification of DNA-binding protein based multiple kernel model. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:13149-13170. [PMID: 37501482 DOI: 10.3934/mbe.2023586] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
DNA-binding proteins (DBPs) play a critical role in the development of drugs for treating genetic diseases and in DNA biology research. It is essential for predicting DNA-binding proteins more accurately and efficiently. In this paper, a Laplacian Local Kernel Alignment-based Restricted Kernel Machine (LapLKA-RKM) is proposed to predict DBPs. In detail, we first extract features from the protein sequence using six methods. Second, the Radial Basis Function (RBF) kernel function is utilized to construct pre-defined kernel metrics. Then, these metrics are combined linearly by weights calculated by LapLKA. Finally, the fused kernel is input to RKM for training and prediction. Independent tests and leave-one-out cross-validation were used to validate the performance of our method on a small dataset and two large datasets. Importantly, we built an online platform to represent our model, which is now freely accessible via http://8.130.69.121:8082/.
Collapse
Affiliation(s)
- Yuqing Qian
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tingting Shang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Chunliang Wang
- The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zhiming Cui
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| |
Collapse
|
10
|
He H, Chen G, Chen CYC. NHGNN-DTA: a node-adaptive hybrid graph neural network for interpretable drug-target binding affinity prediction. Bioinformatics 2023; 39:btad355. [PMID: 37252835 PMCID: PMC10287904 DOI: 10.1093/bioinformatics/btad355] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 04/21/2023] [Accepted: 05/30/2023] [Indexed: 06/01/2023] Open
Abstract
MOTIVATION Large-scale prediction of drug-target affinity (DTA) plays an important role in drug discovery. In recent years, machine learning algorithms have made great progress in DTA prediction by utilizing sequence or structural information of both drugs and proteins. However, sequence-based algorithms ignore the structural information of molecules and proteins, while graph-based algorithms are insufficient in feature extraction and information interaction. RESULTS In this article, we propose NHGNN-DTA, a node-adaptive hybrid neural network for interpretable DTA prediction. It can adaptively acquire feature representations of drugs and proteins and allow information to interact at the graph level, effectively combining the advantages of both sequence-based and graph-based approaches. Experimental results have shown that NHGNN-DTA achieved new state-of-the-art performance. It achieved the mean squared error (MSE) of 0.196 on the Davis dataset (below 0.2 for the first time) and 0.124 on the KIBA dataset (3% improvement). Meanwhile, in the case of cold start scenario, NHGNN-DTA proved to be more robust and more effective with unseen inputs than baseline methods. Furthermore, the multi-head self-attention mechanism endows the model with interpretability, providing new exploratory insights for drug discovery. The case study on Omicron variants of SARS-CoV-2 illustrates the efficient utilization of drug repurposing in COVID-19. AVAILABILITY AND IMPLEMENTATION The source code and data are available at https://github.com/hehh77/NHGNN-DTA.
Collapse
Affiliation(s)
- Haohuai He
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong 518107, P.R. China
| | - Guanxing Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong 518107, P.R. China
| | - Calvin Yu-Chian Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, Guangdong 518107, P.R. China
- Department of Medical Research, China Medical University Hospital, Taichung 40447, Taiwan
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung 41354, Taiwan
| |
Collapse
|
11
|
Wang PH, Zhu YH, Yang X, Yu DJ. GCmapCrys: Integrating graph attention network with predicted contact map for multi-stage protein crystallization propensity prediction. Anal Biochem 2023; 663:115020. [PMID: 36521558 DOI: 10.1016/j.ab.2022.115020] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2022] [Revised: 12/05/2022] [Accepted: 12/10/2022] [Indexed: 12/14/2022]
Abstract
X-ray crystallography is the major approach for atomic-level protein structure determination. Since not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity is critical to guiding the experimental design and improving the success rate of X-ray crystallography experiments. In this work, we proposed a new deep learning pipeline, GCmapCrys, for multi-stage crystallization propensity prediction through integrating graph attention network with predicted protein contact map. Experimental results on 1548 proteins with known crystallization records demonstrated that GCmapCrys increased the value of Matthew's correlation coefficient by 37.0% in average compared to state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of GCmapCrys lie in the efficiency of the graph attention network with predicted contact map, which effectively associates the residue-interaction knowledge with crystallization pattern. Meanwhile, the designed four sequence-based features can be complementary to further enhance crystallization propensity proprediction.
Collapse
Affiliation(s)
- Peng-Hao Wang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China
| | - Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China
| | - Xibei Yang
- School of Computer, Jiangsu University of Science and Technology, Zhenjiang, 212100, PR China
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, PR China.
| |
Collapse
|
12
|
Ma D, Li S, Chen Z. Drug-target binding affinity prediction method based on a deep graph neural network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:269-282. [PMID: 36650765 DOI: 10.3934/mbe.2023012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
The development of new drugs is a long and costly process, Computer-aided drug design reduces development costs while computationally shortening the new drug development cycle, in which DTA (Drug-Target binding Affinity) prediction is a key step to screen out potential drugs. With the development of deep learning, various types of deep learning models have achieved notable performance in a wide range of fields. Most current related studies focus on extracting the sequence features of molecules while ignoring the valuable structural information; they employ sequence data that represent only the elemental composition of molecules without considering the molecular structure maps that contain structural information. In this paper, we use graph neural networks to predict DTA based on corresponding graph data of drugs and proteins, and we achieve competitive performance on two benchmark datasets, Davis and KIBA. In particular, an MSE of 0.227 and CI of 0.895 were obtained on Davis, and an MSE of 0.127 and CI of 0.903 were obtained on KIBA.
Collapse
Affiliation(s)
- Dong Ma
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| | - Shuang Li
- Beidahuang Industry Group General Hospital, Harbin, China
| | - Zhihua Chen
- Institute of Computing Science and Technology, Guangzhou University, Guangzhou, China
| |
Collapse
|
13
|
Mufassirin MMM, Newton MAH, Sattar A. Artificial intelligence for template-free protein structure prediction: a comprehensive review. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10350-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
14
|
Hierarchical graph representation learning for the prediction of drug-target binding affinity. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.09.043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
|
15
|
Zhao Q, Yang M, Cheng Z, Li Y, Wang J. Biomedical Data and Deep Learning Computational Models for Predicting Compound-Protein Relations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2092-2110. [PMID: 33769935 DOI: 10.1109/tcbb.2021.3069040] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The identification of compound-protein relations (CPRs), which includes compound-protein interactions (CPIs) and compound-protein affinities (CPAs), is critical to drug development. A common method for compound-protein relation identification is the use of in vitro screening experiments. However, the number of compounds and proteins is massive, and in vitro screening experiments are labor-intensive, expensive, and time-consuming with high failure rates. Researchers have developed a computational field called virtual screening (VS) to aid experimental drug development. These methods utilize experimentally validated biological interaction information to generate datasets and use the physicochemical and structural properties of compounds and target proteins as input information to train computational prediction models. At present, deep learning has been widely used in computer vision and natural language processing and has experienced epoch-making progress. At the same time, deep learning has also been used in the field of biomedicine widely, and the prediction of CPRs based on deep learning has developed rapidly and has achieved good results. The purpose of this study is to investigate and discuss the latest applications of deep learning techniques in CPR prediction. First, we describe the datasets and feature engineering (i.e., compound and protein representations and descriptors) commonly used in CPR prediction methods. Then, we review and classify recent deep learning approaches in CPR prediction. Next, a comprehensive comparison is performed to demonstrate the prediction performance of representative methods on classical datasets. Finally, we discuss the current state of the field, including the existing challenges and our proposed future directions. We believe that this investigation will provide sufficient references and insight for researchers to understand and develop new deep learning methods to enhance CPR predictions.
Collapse
|
16
|
Jiang M, Wang S, Zhang S, Zhou W, Zhang Y, Li Z. Sequence-based drug-target affinity prediction using weighted graph neural networks. BMC Genomics 2022; 23:449. [PMID: 35715739 PMCID: PMC9205061 DOI: 10.1186/s12864-022-08648-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 05/23/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Affinity prediction between molecule and protein is an important step of virtual screening, which is usually called drug-target affinity (DTA) prediction. Its accuracy directly influences the progress of drug development. Sequence-based drug-target affinity prediction can predict the affinity according to protein sequence, which is fast and can be applied to large datasets. However, due to the lack of protein structure information, the accuracy needs to be improved. RESULTS The proposed model which is called WGNN-DTA can be competent in drug-target affinity (DTA) and compound-protein interaction (CPI) prediction tasks. Various experiments are designed to verify the performance of the proposed method in different scenarios, which proves that WGNN-DTA has the advantages of simplicity and high accuracy. Moreover, because it does not need complex steps such as multiple sequence alignment (MSA), it has fast execution speed, and can be suitable for the screening of large databases. CONCLUSION We construct protein and molecular graphs through sequence and SMILES that can effectively reflect their structures. To utilize the detail contact information of protein, graph neural network is used to extract features and predict the binding affinity based on the graphs, which is called weighted graph neural networks drug-target affinity predictor (WGNN-DTA). The proposed method has the advantages of simplicity and high accuracy.
Collapse
Affiliation(s)
- Mingjian Jiang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, 266525, China
| | - Shuang Wang
- College of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, China
| | - Shugang Zhang
- College of Computer Science and Technology, Ocean University of China, Qingdao, 266100, China
| | - Wei Zhou
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, 266525, China
| | - Yuanyuan Zhang
- School of Information and Control Engineering, Qingdao University of Technology, Qingdao, 266525, China
| | - Zhen Li
- College of Computer Science and Technology, Qingdao University, Qingdao, 266071, China.
| |
Collapse
|
17
|
Zhang H, Huang Y, Bei Z, Ju Z, Meng J, Hao M, Zhang J, Zhang H, Xi W. Inter-Residue Distance Prediction From Duet Deep Learning Models. Front Genet 2022; 13:887491. [PMID: 35651930 PMCID: PMC9148999 DOI: 10.3389/fgene.2022.887491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 03/30/2022] [Indexed: 12/04/2022] Open
Abstract
Residue distance prediction from the sequence is critical for many biological applications such as protein structure reconstruction, protein–protein interaction prediction, and protein design. However, prediction of fine-grained distances between residues with long sequence separations still remains challenging. In this study, we propose DuetDis, a method based on duet feature sets and deep residual network with squeeze-and-excitation (SE), for protein inter-residue distance prediction. DuetDis embraces the ability to learn and fuse features directly or indirectly extracted from the whole-genome/metagenomic databases and, therefore, minimize the information loss through ensembling models trained on different feature sets. We evaluate DuetDis and 11 widely used peer methods on a large-scale test set (610 proteins chains). The experimental results suggest that 1) prediction results from different feature sets show obvious differences; 2) ensembling different feature sets can improve the prediction performance; 3) high-quality multiple sequence alignment (MSA) used for both training and testing can greatly improve the prediction performance; and 4) DuetDis is more accurate than peer methods for the overall prediction, more reliable in terms of model prediction score, and more robust against shallow multiple sequence alignment (MSA).
Collapse
Affiliation(s)
- Huiling Zhang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Ying Huang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhendong Bei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhen Ju
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jintao Meng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Min Hao
- College of Electronic and Information Engineering, Southwest University, Chongqing, China
| | - Jingjing Zhang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Haiping Zhang
- University of Chinese Academy of Sciences, Beijing, China
| | - Wenhui Xi
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
- *Correspondence: Wenhui Xi,
| |
Collapse
|
18
|
Graph neural network approaches for drug-target interactions. Curr Opin Struct Biol 2022; 73:102327. [DOI: 10.1016/j.sbi.2021.102327] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 11/22/2021] [Accepted: 12/13/2021] [Indexed: 01/06/2023]
|
19
|
Du BX, Qin Y, Jiang YF, Xu Y, Yiu SM, Yu H, Shi JY. Compound–protein interaction prediction by deep learning: Databases, descriptors and models. Drug Discov Today 2022; 27:1350-1366. [DOI: 10.1016/j.drudis.2022.02.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Revised: 11/19/2021] [Accepted: 02/28/2022] [Indexed: 11/24/2022]
|
20
|
Lin E, Lin CH, Lane HY. De Novo Peptide and Protein Design Using Generative Adversarial Networks: An Update. J Chem Inf Model 2022; 62:761-774. [DOI: 10.1021/acs.jcim.1c01361] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Eugene Lin
- Department of Biostatistics, University of Washington, Seattle, Washington 98195, United States
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, United States
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan
| | - Chieh-Hsin Lin
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan
- Department of Psychiatry, Kaohsiung Chang Gung Memorial Hospital, Chang Gung University College of Medicine, Kaohsiung 83301, Taiwan
- School of Medicine, Chang Gung University, Taoyuan 33302, Taiwan
| | - Hsien-Yuan Lane
- Graduate Institute of Biomedical Sciences, China Medical University, Taichung 40402, Taiwan
- Department of Psychiatry, China Medical University Hospital, Taichung 40447, Taiwan
- Brain Disease Research Center, China Medical University Hospital, Taichung 40447, Taiwan
- Department of Psychology, College of Medical and Health Sciences, Asia University, Taichung 41354, Taiwan
| |
Collapse
|
21
|
Lu W, Zhou N, Ding Y, Wu H, Zhang Y, Fu Q, Li H. Application of DNA-Binding Protein Prediction Based on Graph Convolutional Network and Contact Map. BIOMED RESEARCH INTERNATIONAL 2022; 2022:9044793. [PMID: 35083336 PMCID: PMC8786515 DOI: 10.1155/2022/9044793] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 12/24/2021] [Indexed: 11/24/2022]
Abstract
DNA contains the genetic information for the synthesis of proteins and RNA, and it is an indispensable substance in living organisms. DNA-binding proteins are an enzyme, which can bind with DNA to produce complex proteins, and play an important role in the functions of a variety of biological molecules. With the continuous development of deep learning, the introduction of deep learning into DNA-binding proteins for prediction is conducive to improving the speed and accuracy of DNA-binding protein recognition. In this study, the features and structures of proteins were used to obtain their representations through graph convolutional networks. A protein prediction model based on graph convolutional network and contact map was proposed. The method had some advantages by testing various indexes of PDB14189 and PDB2272 on the benchmark dataset.
Collapse
Affiliation(s)
- Weizhong Lu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
- Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China
| | - Nan Zhou
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Hongjie Wu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yu Zhang
- Suzhou Industrial Park Institute of Services Outsourcing, Suzhou, China
| | - Qiming Fu
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Haiou Li
- Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China
| |
Collapse
|
22
|
Hou Q, Pucci F, Pan F, Xue F, Rooman M, Feng Q. Using metagenomic data to boost protein structure prediction and discovery. Comput Struct Biotechnol J 2022; 20:434-442. [PMID: 35070166 PMCID: PMC8760478 DOI: 10.1016/j.csbj.2021.12.030] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 12/17/2021] [Accepted: 12/21/2021] [Indexed: 11/19/2022] Open
Abstract
Over the past decade, metagenomic sequencing approaches have been providing an ever-increasing amount of protein sequence data at an astonishing rate. These constitute an invaluable source of information which has been exploited in various research fields such as the study of the role of the gut microbiota in human diseases and aging. However, only a small fraction of all metagenomic sequences collected have been functionally or structurally characterized, leaving much of them completely unexplored. Here, we review how this information has been used in protein structure prediction and protein discovery. We begin by presenting some widely used metagenomic databases and analyze in detail how metagenomic data has contributed to the impressive improvement in the accuracy of structure prediction methods in recent years. We then examine how metagenomic information can be exploited to annotate protein sequences. More specifically, we focus on the role of metagenomes in the discovery of enzymes and new CRISPR-Cas systems, and in the identification of antibiotic resistance genes. With this review, we provide an overview of how metagenomic data is currently revolutionizing our understanding of protein science.
Collapse
Affiliation(s)
- Qingzhen Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong 250012, China
- National Institute of Health Data Science of China, Shandong University, Shandong 250002, China
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Fengming Pan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong 250012, China
- National Institute of Health Data Science of China, Shandong University, Shandong 250002, China
| | - Fuzhong Xue
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Shandong 250012, China
- National Institute of Health Data Science of China, Shandong University, Shandong 250002, China
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, 1050 Brussels, Belgium
| | - Qiang Feng
- Shandong Provincial Key Laboratory of Oral Tissue Regeneration & Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Department of Human Microbiome, School of Stomatology, Shandong University, Jinan, Shandong Province 250012, China
- State Key Laboratory of Microbial Technology, Shandong University, Qingdao, Shandong Province 266237, China
| |
Collapse
|
23
|
Sudha G, Bassot C, Lamb J, Shu N, Huang Y, Elofsson A. The evolutionary history of topological variations in the CPA/AT transporters. PLoS Comput Biol 2021; 17:e1009278. [PMID: 34403419 PMCID: PMC8396727 DOI: 10.1371/journal.pcbi.1009278] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Revised: 08/27/2021] [Accepted: 07/14/2021] [Indexed: 11/23/2022] Open
Abstract
CPA/AT transporters are made up of scaffold and a core domain. The core domain contains two non-canonical helices (broken or reentrant) that mediate the transport of ions, amino acids or other charged compounds. During evolution, these transporters have undergone substantial changes in structure, topology and function. To shed light on these structural transitions, we create models for all families using an integrated topology annotation method. We find that the CPA/AT transporters can be classified into four fold-types based on their structure; (1) the CPA-broken fold-type, (2) the CPA-reentrant fold-type, (3) the BART fold-type, and (4) a previously not described fold-type, the Reentrant-Helix-Reentrant fold-type. Several topological transitions are identified, including the transition between a broken and reentrant helix, one transition between a loop and a reentrant helix, complete changes of orientation, and changes in the number of scaffold helices. These transitions are mainly caused by gene duplication and shuffling events. Structural models, topology information and other details are presented in a searchable database, CPAfold (cpafold.bioinfo.se).
Collapse
Affiliation(s)
- Govindarajan Sudha
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Claudio Bassot
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - John Lamb
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Nanjiang Shu
- Bioinformatics Short-term Support and Infrastructure (BILS), Science for Life Laboratory, Sweden
| | - Yan Huang
- Science for Life Laboratory, Karolinska Institutet, Stockholm University, Solna, Sweden
| | - Arne Elofsson
- Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| |
Collapse
|
24
|
Mulnaes D, Golchin P, Koenig F, Gohlke H. TopDomain: Exhaustive Protein Domain Boundary Metaprediction Combining Multisource Information and Deep Learning. J Chem Theory Comput 2021; 17:4599-4613. [PMID: 34161735 DOI: 10.1021/acs.jctc.1c00129] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Protein domains are independent, functional, and stable structural units of proteins. Accurate protein domain boundary prediction plays an important role in understanding protein structure and evolution, as well as for protein structure prediction. Current domain boundary prediction methods differ in terms of boundary definition, methodology, and training databases resulting in disparate performance for different proteins. We developed TopDomain, an exhaustive metapredictor, that uses deep neural networks to combine multisource information from sequence- and homology-based features of over 50 primary predictors. For this purpose, we developed a new domain boundary data set termed the TopDomain data set, in which the true annotations are informed by SCOPe annotations, structural domain parsers, human inspection, and deep learning. We benchmark TopDomain against 2484 targets with 3354 boundaries from the TopDomain test set and achieve F1 scores of 78.4% and 73.8% for multidomain boundary prediction within ±20 residues and ±10 residues of the true boundary, respectively. When examined on targets from CASP11-13 competitions, TopDomain achieves F1 scores of 47.5% and 42.8% for multidomain proteins. TopDomain significantly outperforms 15 widely used, state-of-the-art ab initio and homology-based domain boundary predictors. Finally, we implemented TopDomainTMC, which accurately predicts whether domain parsing is necessary for the target protein.
Collapse
Affiliation(s)
- Daniel Mulnaes
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Pegah Golchin
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Filip Koenig
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Holger Gohlke
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany.,John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), Institute of Biological Information Processing (IBI-7: Structural Biochemistry) & Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, 52425 Jülich, Germany
| |
Collapse
|
25
|
Reza MS, Zhang H, Hossain MT, Jin L, Feng S, Wei Y. COMTOP: Protein Residue-Residue Contact Prediction through Mixed Integer Linear Optimization. MEMBRANES 2021; 11:membranes11070503. [PMID: 34209399 PMCID: PMC8305966 DOI: 10.3390/membranes11070503] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2021] [Revised: 06/24/2021] [Accepted: 06/25/2021] [Indexed: 11/17/2022]
Abstract
Protein contact prediction helps reconstruct the tertiary structure that greatly determines a protein’s function; therefore, contact prediction from the sequence is an important problem. Recently there has been exciting progress on this problem, but many of the existing methods are still low quality of prediction accuracy. In this paper, we present a new mixed integer linear programming (MILP)-based consensus method: a Consensus scheme based On a Mixed integer linear opTimization method for prOtein contact Prediction (COMTOP). The MILP-based consensus method combines the strengths of seven selected protein contact prediction methods, including CCMpred, EVfold, DeepCov, NNcon, PconsC4, plmDCA, and PSICOV, by optimizing the number of correctly predicted contacts and achieving a better prediction accuracy. The proposed hybrid protein residue–residue contact prediction scheme was tested in four independent test sets. For 239 highly non-redundant proteins, the method showed a prediction accuracy of 59.68%, 70.79%, 78.86%, 89.04%, 94.51%, and 97.35% for top-5L, top-3L, top-2L, top-L, top-L/2, and top-L/5 contacts, respectively. When tested on the CASP13 and CASP14 test sets, the proposed method obtained accuracies of 75.91% and 77.49% for top-L/5 predictions, respectively. COMTOP was further tested on 57 non-redundant α-helical transmembrane proteins and achieved prediction accuracies of 64.34% and 73.91% for top-L/2 and top-L/5 predictions, respectively. For all test datasets, the improvement of COMTOP in accuracy over the seven individual methods increased with the increasing number of predicted contacts. For example, COMTOP performed much better for large number of contact predictions (such as top-5L and top-3L) than for small number of contact predictions such as top-L/2 and top-L/5. The results and analysis demonstrate that COMTOP can significantly improve the performance of the individual methods; therefore, COMTOP is more robust against different types of test sets. COMTOP also showed better/comparable predictions when compared with the state-of-the-art predictors.
Collapse
Affiliation(s)
- Md. Selim Reza
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Huiling Zhang
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Md. Tofazzal Hossain
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Langxi Jin
- Department of Computer Science and Technology, School of Computer Science and Technology, Harbin University of Science and Technology, 52 Xuefu Road, Nangang District, Harbin 150080, China;
| | - Shengzhong Feng
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
| | - Yanjie Wei
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049, China; (M.S.R.); (H.Z.); (M.T.H.)
- Centre for High Performance Computing, Joint Engineering Research Center for Health Big Data Intelligent Analysis Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
- Correspondence:
| |
Collapse
|
26
|
Druchok M, Yarish D, Garkot S, Nikolaienko T, Gurbych O. Ensembling machine learning models to boost molecular affinity prediction. Comput Biol Chem 2021; 93:107529. [PMID: 34192653 DOI: 10.1016/j.compbiolchem.2021.107529] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/01/2021] [Accepted: 06/08/2021] [Indexed: 02/01/2023]
Abstract
This study unites six popular machine learning approaches to enhance the prediction of a molecular binding affinity between receptors (large protein molecules) and ligands (small organic molecules). Here we examine a scheme where affinity of ligands is predicted against a single receptor - human thrombin, thus, the models consider ligand features only. However, the suggested approach can be repurposed for other receptors. The methods include Support Vector Machine, Random Forest, CatBoost, feed-forward neural network, graph neural network, and Bidirectional Encoder Representations from Transformers. The first five methods use input features based on physico-chemical properties of molecules, while the last one is based on textual molecular representations. All approaches do not rely on atomic spatial coordinates, avoiding a potential bias from known structures, and are capable of generalizing for compounds with unknown conformations. Within each of the methods, we have trained two models that solve classification and regression tasks. Then, all models are grouped into a pipeline of two subsequent ensembles. The first ensemble aggregates six classification models which vote whether a ligand binds to a receptor or not. If a ligand is classified as active (i.e., binds), the second ensemble predicts its binding affinity in terms of the inhibition constant Ki.
Collapse
Affiliation(s)
- Maksym Druchok
- SoftServe, Inc., 2d Sadova Str., 79021 Lviv, Ukraine; Institute for Condensed Matter Physics, NAS of Ukraine, 1 Svientsitskii Str., 79011 Lviv, Ukraine.
| | | | - Sofiya Garkot
- SoftServe, Inc., 2d Sadova Str., 79021 Lviv, Ukraine; Ukrainian Catholic University, 17 Svientsitskii Str., 79011 Lviv, Ukraine
| | - Tymofii Nikolaienko
- SoftServe, Inc., 2d Sadova Str., 79021 Lviv, Ukraine; Taras Shevchenko National University of Kyiv, 64/13, Volodymyrska Str., 01601 Kyiv, Ukraine
| | - Oleksandr Gurbych
- SoftServe, Inc., 2d Sadova Str., 79021 Lviv, Ukraine; Lviv Polytechnic National University, 5 Kniazia Romana Str., 79005 Lviv, Ukraine
| |
Collapse
|
27
|
Suh D, Lee JW, Choi S, Lee Y. Recent Applications of Deep Learning Methods on Evolution- and Contact-Based Protein Structure Prediction. Int J Mol Sci 2021; 22:6032. [PMID: 34199677 PMCID: PMC8199773 DOI: 10.3390/ijms22116032] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2021] [Revised: 05/29/2021] [Accepted: 05/29/2021] [Indexed: 01/23/2023] Open
Abstract
The new advances in deep learning methods have influenced many aspects of scientific research, including the study of the protein system. The prediction of proteins' 3D structural components is now heavily dependent on machine learning techniques that interpret how protein sequences and their homology govern the inter-residue contacts and structural organization. Especially, methods employing deep neural networks have had a significant impact on recent CASP13 and CASP14 competition. Here, we explore the recent applications of deep learning methods in the protein structure prediction area. We also look at the potential opportunities for deep learning methods to identify unknown protein structures and functions to be discovered and help guide drug-target interactions. Although significant problems still need to be addressed, we expect these techniques in the near future to play crucial roles in protein structural bioinformatics as well as in drug discovery.
Collapse
Affiliation(s)
- Donghyuk Suh
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Jai Woo Lee
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Sun Choi
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Yoonji Lee
- College of Pharmacy, Chung-Ang University, Seoul 06974, Korea
| |
Collapse
|
28
|
Protein Structure Prediction: Conventional and Deep Learning Perspectives. Protein J 2021; 40:522-544. [PMID: 34050498 DOI: 10.1007/s10930-021-10003-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/21/2021] [Indexed: 10/21/2022]
Abstract
Protein structure prediction is a way to bridge the sequence-structure gap, one of the main challenges in computational biology and chemistry. Predicting any protein's accurate structure is of paramount importance for the scientific community, as these structures govern their function. Moreover, this is one of the complicated optimization problems that computational biologists have ever faced. Experimental protein structure determination methods include X-ray crystallography, Nuclear Magnetic Resonance Spectroscopy and Electron Microscopy. All of these are tedious and time-consuming procedures that require expertise. To make the process less cumbersome, scientists use predictive tools as part of computational methods, using data consolidated in the protein repositories. In recent years, machine learning approaches have raised the interest of the structure prediction community. Most of the machine learning approaches for protein structure prediction are centred on co-evolution based methods. The accuracy of these approaches depends on the number of homologous protein sequences available in the databases. The prediction problem becomes challenging for many proteins, especially those without enough sequence homologs. Deep learning methods allow for the extraction of intricate features from protein sequence data without making any intuitions. Accurately predicted protein structures are employed for drug discovery, antibody designs, understanding protein-protein interactions, and interactions with other molecules. This article provides a review of conventional and deep learning approaches in protein structure prediction. We conclude this review by outlining a few publicly available datasets and deep learning architectures currently employed for protein structure prediction tasks.
Collapse
|
29
|
Pakhrin SC, Shrestha B, Adhikari B, KC DB. Deep Learning-Based Advances in Protein Structure Prediction. Int J Mol Sci 2021; 22:5553. [PMID: 34074028 PMCID: PMC8197379 DOI: 10.3390/ijms22115553] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 05/12/2021] [Accepted: 05/18/2021] [Indexed: 12/29/2022] Open
Abstract
Obtaining an accurate description of protein structure is a fundamental step toward understanding the underpinning of biology. Although recent advances in experimental approaches have greatly enhanced our capabilities to experimentally determine protein structures, the gap between the number of protein sequences and known protein structures is ever increasing. Computational protein structure prediction is one of the ways to fill this gap. Recently, the protein structure prediction field has witnessed a lot of advances due to Deep Learning (DL)-based approaches as evidenced by the success of AlphaFold2 in the most recent Critical Assessment of protein Structure Prediction (CASP14). In this article, we highlight important milestones and progresses in the field of protein structure prediction due to DL-based methods as observed in CASP experiments. We describe advances in various steps of protein structure prediction pipeline viz. protein contact map prediction, protein distogram prediction, protein real-valued distance prediction, and Quality Assessment/refinement. We also highlight some end-to-end DL-based approaches for protein structure prediction approaches. Additionally, as there have been some recent DL-based advances in protein structure determination using Cryo-Electron (Cryo-EM) microscopy based, we also highlight some of the important progress in the field. Finally, we provide an outlook and possible future research directions for DL-based approaches in the protein structure prediction arena.
Collapse
Affiliation(s)
- Subash C. Pakhrin
- Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260, USA;
| | - Bikash Shrestha
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63121, USA;
| | - Badri Adhikari
- Department of Computer Science, University of Missouri-St. Louis, St. Louis, MO 63121, USA;
| | - Dukka B. KC
- Department of Electrical Engineering and Computer Science, Wichita State University, Wichita, KS 67260, USA;
| |
Collapse
|
30
|
Zhang H, Bei Z, Xi W, Hao M, Ju Z, Saravanan KM, Zhang H, Guo N, Wei Y. Evaluation of residue-residue contact prediction methods: From retrospective to prospective. PLoS Comput Biol 2021; 17:e1009027. [PMID: 34029314 PMCID: PMC8177648 DOI: 10.1371/journal.pcbi.1009027] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 06/04/2021] [Accepted: 04/28/2021] [Indexed: 12/31/2022] Open
Abstract
Sequence-based residue contact prediction plays a crucial role in protein structure reconstruction. In recent years, the combination of evolutionary coupling analysis (ECA) and deep learning (DL) techniques has made tremendous progress for residue contact prediction, thus a comprehensive assessment of current methods based on a large-scale benchmark data set is very needed. In this study, we evaluate 18 contact predictors on 610 non-redundant proteins and 32 CASP13 targets according to a wide range of perspectives. The results show that different methods have different application scenarios: (1) DL methods based on multi-categories of inputs and large training sets are the best choices for low-contact-density proteins such as the intrinsically disordered ones and proteins with shallow multi-sequence alignments (MSAs). (2) With at least 5L (L is sequence length) effective sequences in the MSA, all the methods show the best performance, and methods that rely only on MSA as input can reach comparable achievements as methods that adopt multi-source inputs. (3) For top L/5 and L/2 predictions, DL methods can predict more hydrophobic interactions while ECA methods predict more salt bridges and disulfide bonds. (4) ECA methods can detect more secondary structure interactions, while DL methods can accurately excavate more contact patterns and prune isolated false positives. In general, multi-input DL methods with large training sets dominate current approaches with the best overall performance. Despite the great success of current DL methods must be stated the fact that there is still much room left for further improvement: (1) With shallow MSAs, the performance will be greatly affected. (2) Current methods show lower precisions for inter-domain compared with intra-domain contact predictions, as well as very high imbalances in precisions between intra-domains. (3) Strong prediction similarities between DL methods indicating more feature types and diversified models need to be developed. (4) The runtime of most methods can be further optimized. The amino acid sequence of a protein ultimately determines its tertiary structure, and the tertiary structure determines its function(s) and plays a key role in understanding biological processes and disease pathogenesis. Protein tertiary structure can be determined using experimental techniques such as cryo-electron microscopy, nuclear magnetic resonance and X-ray crystallography, which are very expensive and time-consuming. As an alternative, researchers are trying to use in silico methods to predict the 3D structures. Residue contact-assisted protein folding paves an avenue for sequence-based protein structure prediction and therefore has become one of the most challenging and promising problems in structural bioinformatics. Over the past years, contact prediction has undergone continuous evolution in techniques. Through a retrospective analysis of traditional machine learning /evolutionary coupling analysis methods/ consensus machine learning methods and a multi-perspective study on recently developed deep learning methods, we explore the most advanced contact predictors, pursue application scenarios for different methods, and seek prospective directions for further improvement. We anticipate that our study will serve as a practical and useful guide for the development of future approaches to contact prediction.
Collapse
Affiliation(s)
- Huiling Zhang
- University of Chinese Academy of Sciences, Beijing, China
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Zhendong Bei
- Cloud Computing Department, Alibaba Group, Hangzhou, China
| | - Wenhui Xi
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Min Hao
- College of Electronic and Information Engineering, Southwest University, Chongqing, China
| | - Zhen Ju
- University of Chinese Academy of Sciences, Beijing, China
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Konda Mani Saravanan
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Haiping Zhang
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Ning Guo
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Yanjie Wei
- University of Chinese Academy of Sciences, Beijing, China
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- * E-mail:
| |
Collapse
|
31
|
Kimber TB, Chen Y, Volkamer A. Deep Learning in Virtual Screening: Recent Applications and Developments. Int J Mol Sci 2021; 22:4435. [PMID: 33922714 PMCID: PMC8123040 DOI: 10.3390/ijms22094435] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 04/13/2021] [Accepted: 04/14/2021] [Indexed: 01/03/2023] Open
Abstract
Drug discovery is a cost and time-intensive process that is often assisted by computational methods, such as virtual screening, to speed up and guide the design of new compounds. For many years, machine learning methods have been successfully applied in the context of computer-aided drug discovery. Recently, thanks to the rise of novel technologies as well as the increasing amount of available chemical and bioactivity data, deep learning has gained a tremendous impact in rational active compound discovery. Herein, recent applications and developments of machine learning, with a focus on deep learning, in virtual screening for active compound design are reviewed. This includes introducing different compound and protein encodings, deep learning techniques as well as frequently used bioactivity and benchmark data sets for model training and testing. Finally, the present state-of-the-art, including the current challenges and emerging problems, are examined and discussed.
Collapse
Affiliation(s)
| | | | - Andrea Volkamer
- In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité-Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany; (T.B.K.); (Y.C.)
| |
Collapse
|
32
|
Accurate contact-based modelling of repeat proteins predicts the structure of new repeats protein families. PLoS Comput Biol 2021; 17:e1008798. [PMID: 33857128 PMCID: PMC8078820 DOI: 10.1371/journal.pcbi.1008798] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Revised: 04/27/2021] [Accepted: 02/15/2021] [Indexed: 12/18/2022] Open
Abstract
Repeat proteins are abundant in eukaryotic proteomes. They are involved in many eukaryotic specific functions, including signalling. For many of these proteins, the structure is not known, as they are difficult to crystallise. Today, using direct coupling analysis and deep learning it is often possible to predict a protein’s structure. However, the unique sequence features present in repeat proteins have been a challenge to use direct coupling analysis for predicting contacts. Here, we show that deep learning-based methods (trRosetta, DeepMetaPsicov (DMP) and PconsC4) overcomes this problem and can predict intra- and inter-unit contacts in repeat proteins. In a benchmark dataset of 815 repeat proteins, about 90% can be correctly modelled. Further, among 48 PFAM families lacking a protein structure, we produce models of forty-one families with estimated high accuracy. Repeat proteins are widespread among organisms and particularly abundant in eukaryotic proteomes. Their primary sequence presents repetition in the amino acid sequences that origin structures with repeated folds/domains. Although the repeated units often can be recognised from the sequence alone, often structural information is missing. Here, we used contact prediction for predicting the structure of repeats protein directly from their primary sequences. We benchmark the methods on a dataset comprehensive of all the known repeated structures. We evaluate the contact predictions and the obtained models for different classes of repeat proteins. Further, we develop and benchmark a quality assessment (QA) method specific for repeat proteins. Finally, we used the prediction pipeline for all PFAM repeat families without resolved structures and found that forty-one of them could be modelled with high accuracy.
Collapse
|
33
|
Hameduh T, Haddad Y, Adam V, Heger Z. Homology modeling in the time of collective and artificial intelligence. Comput Struct Biotechnol J 2020; 18:3494-3506. [PMID: 33304450 PMCID: PMC7695898 DOI: 10.1016/j.csbj.2020.11.007] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Revised: 11/04/2020] [Accepted: 11/04/2020] [Indexed: 12/12/2022] Open
Abstract
Homology modeling is a method for building protein 3D structures using protein primary sequence and utilizing prior knowledge gained from structural similarities with other proteins. The homology modeling process is done in sequential steps where sequence/structure alignment is optimized, then a backbone is built and later, side-chains are added. Once the low-homology loops are modeled, the whole 3D structure is optimized and validated. In the past three decades, a few collective and collaborative initiatives allowed for continuous progress in both homology and ab initio modeling. Critical Assessment of protein Structure Prediction (CASP) is a worldwide community experiment that has historically recorded the progress in this field. Folding@Home and Rosetta@Home are examples of crowd-sourcing initiatives where the community is sharing computational resources, whereas RosettaCommons is an example of an initiative where a community is sharing a codebase for the development of computational algorithms. Foldit is another initiative where participants compete with each other in a protein folding video game to predict 3D structure. In the past few years, contact maps deep machine learning was introduced to the 3D structure prediction process, adding more information and increasing the accuracy of models significantly. In this review, we will take the reader in a journey of exploration from the beginnings to the most recent turnabouts, which have revolutionized the field of homology modeling. Moreover, we discuss the new trends emerging in this rapidly growing field.
Collapse
Affiliation(s)
- Tareq Hameduh
- Department of Chemistry and Biochemistry, Mendel University in Brno, Zemedelska 1, CZ-613 00 Brno, Czech Republic
| | - Yazan Haddad
- Department of Chemistry and Biochemistry, Mendel University in Brno, Zemedelska 1, CZ-613 00 Brno, Czech Republic
- Central European Institute of Technology, Brno University of Technology, Purkynova 656/123, 612 00 Brno, Czech Republic
| | - Vojtech Adam
- Department of Chemistry and Biochemistry, Mendel University in Brno, Zemedelska 1, CZ-613 00 Brno, Czech Republic
- Central European Institute of Technology, Brno University of Technology, Purkynova 656/123, 612 00 Brno, Czech Republic
| | - Zbynek Heger
- Department of Chemistry and Biochemistry, Mendel University in Brno, Zemedelska 1, CZ-613 00 Brno, Czech Republic
- Central European Institute of Technology, Brno University of Technology, Purkynova 656/123, 612 00 Brno, Czech Republic
| |
Collapse
|
34
|
Muscat M, Croce G, Sarti E, Weigt M. FilterDCA: Interpretable supervised contact prediction using inter-domain coevolution. PLoS Comput Biol 2020; 16:e1007621. [PMID: 33035205 PMCID: PMC7577475 DOI: 10.1371/journal.pcbi.1007621] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 10/21/2020] [Accepted: 08/20/2020] [Indexed: 12/03/2022] Open
Abstract
Predicting three-dimensional protein structure and assembling protein complexes using sequence information belongs to the most prominent tasks in computational biology. Recently substantial progress has been obtained in the case of single proteins using a combination of unsupervised coevolutionary sequence analysis with structurally supervised deep learning. While reaching impressive accuracies in predicting residue-residue contacts, deep learning has a number of disadvantages. The need for large structural training sets limits the applicability to multi-protein complexes; and their deep architecture makes the interpretability of the convolutional neural networks intrinsically hard. Here we introduce FilterDCA, a simpler supervised predictor for inter-domain and inter-protein contacts. It is based on the fact that contact maps of proteins show typical contact patterns, which results from secondary structure and are reflected by patterns in coevolutionary analysis. We explicitly integrate averaged contacts patterns with coevolutionary scores derived by Direct Coupling Analysis, improving performance over standard coevolutionary analysis, while remaining fully transparent and interpretable. The FilterDCA code is available at http://gitlab.lcqb.upmc.fr/muscat/FilterDCA. The de novo prediction of tertiary and quaternary protein structures has recently seen important advances, by combining unsupervised, purely sequence-based coevolutionary analyses with structure-based supervision using deep learning for contact-map prediction. While showing impressive performance, deep-learning methods require large training sets and pose severe obstacles for their interpretability. Here we construct a simple, transparent and therefore fully interpretable inter-domain contact predictor, which uses the results of coevolutionary Direct Coupling Analysis in combination with explicitly constructed filters reflecting typical contact patterns in a training set of known protein structures, and which improves the accuracy of predicted contacts significantly. Our approach thereby sheds light on the question how contact information is encoded in coevolutionary signals.
Collapse
Affiliation(s)
- Maureen Muscat
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative – LCQB, 75005 Paris, France
| | - Giancarlo Croce
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative – LCQB, 75005 Paris, France
| | - Edoardo Sarti
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative – LCQB, 75005 Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Biologie Computationnelle et Quantitative – LCQB, 75005 Paris, France
- * E-mail:
| |
Collapse
|
35
|
Zhang C, Zheng W, Mortuza SM, Li Y, Zhang Y. DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics 2020; 36:2105-2112. [PMID: 31738385 DOI: 10.1093/bioinformatics/btz863] [Citation(s) in RCA: 110] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 10/17/2019] [Accepted: 11/15/2019] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. RESULTS We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/DeepMSA/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
36
|
Guo Z, Hou J, Cheng J. DNSS2: Improved ab initio protein secondary structure prediction using advanced deep learning architectures. Proteins 2020; 89:207-217. [PMID: 32893403 DOI: 10.1002/prot.26007] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2020] [Revised: 07/07/2020] [Accepted: 09/02/2020] [Indexed: 12/27/2022]
Abstract
Accurate prediction of protein secondary structure (alpha-helix, beta-strand and coil) is a crucial step for protein inter-residue contact prediction and ab initio tertiary structure prediction. In a previous study, we developed a deep belief network-based protein secondary structure method (DNSS1) and successfully advanced the prediction accuracy beyond 80%. In this work, we developed multiple advanced deep learning architectures (DNSS2) to further improve secondary structure prediction. The major improvements over the DNSS1 method include (a) designing and integrating six advanced one-dimensional deep convolutional/recurrent/residual/memory/fractal/inception networks to predict 3-state and 8-state secondary structure, and (b) using more sensitive profile features inferred from Hidden Markov model (HMM) and multiple sequence alignment (MSA). Most of the deep learning architectures are novel for protein secondary structure prediction. DNSS2 was systematically benchmarked on independent test data sets with eight state-of-art tools and consistently ranked as one of the best methods. Particularly, DNSS2 was tested on the protein targets of 2018 CASP13 experiment and achieved the Q3 score of 81.62%, SOV score of 72.19%, and Q8 score of 73.28%. DNSS2 is freely available at: https://github.com/multicom-toolbox/DNSS2.
Collapse
Affiliation(s)
- Zhiye Guo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, Missouri, USA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
37
|
Sun J, Frishman D. DeepHelicon: Accurate prediction of inter-helical residue contacts in transmembrane proteins by residual neural networks. J Struct Biol 2020; 212:107574. [PMID: 32663598 DOI: 10.1016/j.jsb.2020.107574] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Revised: 07/03/2020] [Accepted: 07/07/2020] [Indexed: 01/16/2023]
Abstract
Accurate prediction of amino acid residue contacts is an important prerequisite for generating high-quality 3D models of transmembrane (TM) proteins. While a large number of compositional, evolutionary, and structural properties of proteins can be used to train contact prediction methods, recent research suggests that coevolution between residues provides the strongest indication of their spatial proximity. We have developed a deep learning approach, DeepHelicon, to predict inter-helical residue contacts in TM proteins by considering only coevolutionary features. DeepHelicon comprises a two-stage supervised learning process by residual neural networks for a gradual refinement of contact maps, followed by variance reduction by an ensemble of models. We present a benchmark study of 12 contact predictors and conclude that DeepHelicon together with the two other state-of-the-art methods DeepMetaPSICOV and Membrain2 outperforms the 10 remaining algorithms on all datasets and at all settings. On a set of 44 TM proteins with an average length of 388 residues DeepHelicon achieves the best performance among all benchmarked methods in predicting the top L/5 and L/2 inter-helical contacts, with the mean precision of 87.42% and 77.84%, respectively. On a set of 57 relatively small TM proteins with an average length of 298 residues DeepHelicon ranks second best after DeepMetaPSICOV. DeepHelicon produces the most accurate predictions for large proteins with more than 10 transmembrane helices. Coevolutionary features alone allow to predict inter-helical residue contacts with an accuracy sufficient for generating acceptable 3D models for up to 30% of proteins using a fully automated modeling method such as CONFOLD2.
Collapse
Affiliation(s)
- Jianfeng Sun
- Department of Bioinformatics, Wissenschaftzentrum Weihenstephan, Technische Universität München, 85354 Freising, Germany
| | - Dmitrij Frishman
- Department of Bioinformatics, Wissenschaftzentrum Weihenstephan, Technische Universität München, 85354 Freising, Germany.
| |
Collapse
|
38
|
Jiang M, Li Z, Zhang S, Wang S, Wang X, Yuan Q, Wei Z. Drug-target affinity prediction using graph neural network and contact maps. RSC Adv 2020; 10:20701-20712. [PMID: 35517730 PMCID: PMC9054320 DOI: 10.1039/d0ra02297g] [Citation(s) in RCA: 119] [Impact Index Per Article: 29.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Accepted: 05/07/2020] [Indexed: 02/01/2023] Open
Abstract
Computer-aided drug design uses high-performance computers to simulate the tasks in drug design, which is a promising research area. Drug-target affinity (DTA) prediction is the most important step of computer-aided drug design, which could speed up drug development and reduce resource consumption. With the development of deep learning, the introduction of deep learning to DTA prediction and improving the accuracy have become a focus of research. In this paper, utilizing the structural information of molecules and proteins, two graphs of drug molecules and proteins are built up respectively. Graph neural networks are introduced to obtain their representations, and a method called DGraphDTA is proposed for DTA prediction. Specifically, the protein graph is constructed based on the contact map output from the prediction method, which could predict the structural characteristics of the protein according to its sequence. It can be seen from the test of various metrics on benchmark datasets that the method proposed in this paper has strong robustness and generalizability.
Collapse
Affiliation(s)
- Mingjian Jiang
- Department of Computer Science and Technology, Ocean University of China China
| | - Zhen Li
- Department of Computer Science and Technology, Ocean University of China China
| | - Shugang Zhang
- Department of Computer Science and Technology, Ocean University of China China
| | - Shuang Wang
- Department of Computer Science and Technology, Ocean University of China China
| | - Xiaofeng Wang
- Department of Computer Science and Technology, Ocean University of China China
| | - Qing Yuan
- Department of Computer Science and Technology, Ocean University of China China
| | - Zhiqiang Wei
- Department of Computer Science and Technology, Ocean University of China China
| |
Collapse
|
39
|
Protein Contact Map Prediction Based on ResNet and DenseNet. BIOMED RESEARCH INTERNATIONAL 2020; 2020:7584968. [PMID: 32337273 PMCID: PMC7165324 DOI: 10.1155/2020/7584968] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Accepted: 03/05/2020] [Indexed: 11/18/2022]
Abstract
Residue-residue contact prediction has become an increasingly important tool for modeling the three-dimensional structure of a protein when no homologous structure is available. Ultradeep residual neural network (ResNet) has become the most popular method for making contact predictions because it captures the contextual information between residues. In this paper, we propose a novel deep neural network framework for contact prediction which combines ResNet and DenseNet. This framework uses 1D ResNet to process sequential features, and besides PSSM, SS3, and solvent accessibility, we have introduced a new feature, position-specific frequency matrix (PSFM), as an input. Using ResNet's residual module and identity mapping, it can effectively process sequential features after which the outer concatenation function is used for sequential and pairwise features. Prediction accuracy is improved following a final processing step using the dense connection of DenseNet. The prediction accuracy of the protein contact map shows that our method is more effective than other popular methods due to the new network architecture and the added feature input.
Collapse
|
40
|
Torrisi M, Pollastri G, Le Q. Deep learning methods in protein structure prediction. Comput Struct Biotechnol J 2020; 18:1301-1310. [PMID: 32612753 PMCID: PMC7305407 DOI: 10.1016/j.csbj.2019.12.011] [Citation(s) in RCA: 116] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Revised: 12/19/2019] [Accepted: 12/20/2019] [Indexed: 01/01/2023] Open
Abstract
Protein Structure Prediction is a central topic in Structural Bioinformatics. Since the '60s statistical methods, followed by increasingly complex Machine Learning and recently Deep Learning methods, have been employed to predict protein structural information at various levels of detail. In this review, we briefly introduce the problem of protein structure prediction and essential elements of Deep Learning (such as Convolutional Neural Networks, Recurrent Neural Networks and basic feed-forward Neural Networks they are founded on), after which we discuss the evolution of predictive methods for one-dimensional and two-dimensional Protein Structure Annotations, from the simple statistical methods of the early days, to the computationally intensive highly-sophisticated Deep Learning algorithms of the last decade. In the process, we review the growth of the databases these algorithms are based on, and how this has impacted our ability to leverage knowledge about evolution and co-evolution to achieve improved predictions. We conclude this review outlining the current role of Deep Learning techniques within the wider pipelines to predict protein structures and trying to anticipate what challenges and opportunities may arise next.
Collapse
Affiliation(s)
- Mirko Torrisi
- School of Computer Science, University College Dublin, Ireland
| | | | - Quan Le
- Centre for Applied Data Analytics Research, University College Dublin, Ireland
| |
Collapse
|
41
|
PconsFam: An Interactive Database of Structure Predictions of Pfam Families. J Mol Biol 2019; 431:2442-2448. [DOI: 10.1016/j.jmb.2019.01.047] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Revised: 01/30/2019] [Accepted: 01/31/2019] [Indexed: 11/17/2022]
|
42
|
Bassot C, Menendez Hurtado D, Elofsson A. Using PconsC4 and PconsFold2 to Predict Protein Structure. ACTA ACUST UNITED AC 2019; 66:e75. [PMID: 31063641 DOI: 10.1002/cpbi.75] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In spite of the fact that there has been a significant increase in the number of solved protein structures, structural information is missing for many proteins. Although structural information is codified in the amino acid sequence, computational prediction using only this information is still an unsolved problem. However, one successful method to model a protein's structure starting from the primary sequence is to use contact prediction derived from multiple sequence alignment (MSA). Here we use our contact predictor PconsC4 to generate a list of probable contacts between residues in the primary sequences. These contacts are then used together with the secondary structure prediction as constraints for the CONFOLD folding method. In this way, a 3D protein model can be built starting directly from the primary sequence. © 2019 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Claudio Bassot
- Department of Biochemistry and Biophysics, Stockholm University and Science for Life Laboratory, Solna, Sweden
| | - David Menendez Hurtado
- Department of Biochemistry and Biophysics, Stockholm University and Science for Life Laboratory, Solna, Sweden
| | - Arne Elofsson
- Department of Biochemistry and Biophysics, Stockholm University and Science for Life Laboratory, Solna, Sweden
| |
Collapse
|
43
|
Hou J, Wu T, Cao R, Cheng J. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins 2019; 87:1165-1178. [PMID: 30985027 PMCID: PMC6800999 DOI: 10.1002/prot.25697] [Citation(s) in RCA: 99] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Revised: 04/04/2019] [Accepted: 04/12/2019] [Indexed: 12/28/2022]
Abstract
Predicting residue‐residue distance relationships (eg, contacts) has become the key direction to advance protein structure prediction since 2014 CASP11 experiment, while deep learning has revolutionized the technology for contact and distance distribution prediction since its debut in 2012 CASP10 experiment. During 2018 CASP13 experiment, we enhanced our MULTICOM protein structure prediction system with three major components: contact distance prediction based on deep convolutional neural networks, distance‐driven template‐free (ab initio) modeling, and protein model ranking empowered by deep learning and contact prediction. Our experiment demonstrates that contact distance prediction and deep learning methods are the key reasons that MULTICOM was ranked 3rd out of all 98 predictors in both template‐free and template‐based structure modeling in CASP13. Deep convolutional neural network can utilize global information in pairwise residue‐residue features such as coevolution scores to substantially improve contact distance prediction, which played a decisive role in correctly folding some free modeling and hard template‐based modeling targets. Deep learning also successfully integrated one‐dimensional structural features, two‐dimensional contact information, and three‐dimensional structural quality scores to improve protein model quality assessment, where the contact prediction was demonstrated to consistently enhance ranking of protein models for the first time. The success of MULTICOM system clearly shows that protein contact distance prediction and model selection driven by deep learning holds the key of solving protein structure prediction problem. However, there are still challenges in accurately predicting protein contact distance when there are few homologous sequences, folding proteins from noisy contact distances, and ranking models of hard targets.
Collapse
Affiliation(s)
- Jie Hou
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| | - Tianqi Wu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, Washington
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| |
Collapse
|