51
|
Du BX, Qin Y, Jiang YF, Xu Y, Yiu SM, Yu H, Shi JY. Compound–protein interaction prediction by deep learning: Databases, descriptors and models. Drug Discov Today 2022; 27:1350-1366. [DOI: 10.1016/j.drudis.2022.02.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Revised: 11/19/2021] [Accepted: 02/28/2022] [Indexed: 11/24/2022]
|
52
|
Jiang P, Chi Y, Li XS, Liu X, Hua XS, Xia K. Molecular persistent spectral image (Mol-PSI) representation for machine learning models in drug design. Brief Bioinform 2022; 23:6485012. [PMID: 34958660 DOI: 10.1093/bib/bbab527] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 11/01/2021] [Accepted: 11/14/2021] [Indexed: 01/05/2023] Open
Abstract
Artificial intelligence (AI)-based drug design has great promise to fundamentally change the landscape of the pharmaceutical industry. Even though there are great progress from handcrafted feature-based machine learning models, 3D convolutional neural networks (CNNs) and graph neural networks, effective and efficient representations that characterize the structural, physical, chemical and biological properties of molecular structures and interactions remain to be a great challenge. Here, we propose an equal-sized molecular 2D image representation, known as the molecular persistent spectral image (Mol-PSI), and combine it with CNN model for AI-based drug design. Mol-PSI provides a unique one-to-one image representation for molecular structures and interactions. In general, deep models are empowered to achieve better performance with systematically organized representations in image format. A well-designed parallel CNN architecture for adapting Mol-PSIs is developed for protein-ligand binding affinity prediction. Our results, for the three most commonly used databases, including PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016, are better than all traditional machine learning models, as far as we know. Our Mol-PSI model provides a powerful molecular representation that can be widely used in AI-based drug design and molecular data analysis.
Collapse
Affiliation(s)
- Peiran Jiang
- Drug Discovery Intelligence, AI Center, Alibaba Group DAMO Academy, Wen Yi Xi Road, Yuhang District, Hangzhou City , 310000, Zhejiang, China
| | - Ying Chi
- Drug Discovery Intelligence, AI Center, Alibaba Group DAMO Academy, Wen Yi Xi Road, Yuhang District, Hangzhou City , 310000, Zhejiang, China
| | - Xiao-Shuang Li
- Drug Discovery Intelligence, AI Center, Alibaba Group DAMO Academy, Wen Yi Xi Road, Yuhang District, Hangzhou City , 310000, Zhejiang, China
| | - Xiang Liu
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore
- Chern Institute of Mathematics and LPMC, Nankai University, 300071, Tianjin, China
| | - Xian-Sheng Hua
- Drug Discovery Intelligence, AI Center, Alibaba Group DAMO Academy, Wen Yi Xi Road, Yuhang District, Hangzhou City , 310000, Zhejiang, China
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore
| |
Collapse
|
53
|
Protein-ligand binding affinity prediction based on profiles of intermolecular contacts. Comput Struct Biotechnol J 2022; 20:1088-1096. [PMID: 35317230 PMCID: PMC8902473 DOI: 10.1016/j.csbj.2022.02.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2021] [Revised: 02/08/2022] [Accepted: 02/08/2022] [Indexed: 11/30/2022] Open
Abstract
As a key element in structure-based drug design, binding affinity prediction (BAP) for putative protein-ligand complexes can be efficiently achieved by the incorporation of structural descriptors and machine-learning models. However, developing concise descriptors that will lead to accurate and interpretable BAP remains a difficult problem in this field. Herein, we introduce the profiles of intermolecular contacts (IMCPs) as descriptors for machine-learning-based BAP. IMCPs describe each group of protein-ligand contacts by the count and average distance of the group members, and collaborate closely with classical machine-learning models. Performed on multiple validation sets, IMCP-based models often result in better BAP accuracy than those originating from other similar descriptors. Additionally, IMCPs are simple and concise, and easy to interpret in model training. These descriptors highly conclude the structural information of protein-ligand complexes and can be easily updated with personalized profile features. IMCPs have been implemented in the BAP Toolkit on github ( https://github.com/debbydanwang/BAP).
Collapse
|
54
|
Jiang D, Hsieh CY, Wu Z, Kang Y, Wang J, Wang E, Liao B, Shen C, Xu L, Wu J, Cao D, Hou T. InteractionGraphNet: A Novel and Efficient Deep Graph Representation Learning Framework for Accurate Protein-Ligand Interaction Predictions. J Med Chem 2021; 64:18209-18232. [PMID: 34878785 DOI: 10.1021/acs.jmedchem.1c01830] [Citation(s) in RCA: 74] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Accurate quantification of protein-ligand interactions remains a key challenge to structure-based drug design. However, traditional machine learning (ML)-based methods based on handcrafted descriptors, one-dimensional protein sequences, and/or two-dimensional graph representations limit their capability to learn the generalized molecular interactions in 3D space. Here, we proposed a novel deep graph representation learning framework named InteractionGraphNet (IGN) to learn the protein-ligand interactions from the 3D structures of protein-ligand complexes. In IGN, two independent graph convolution modules were stacked to sequentially learn the intramolecular and intermolecular interactions, and the learned intermolecular interactions can be efficiently used for subsequent tasks. Extensive binding affinity prediction, large-scale structure-based virtual screening, and pose prediction experiments demonstrated that IGN achieved better or competitive performance against other state-of-the-art ML-based baselines and docking programs. More importantly, such state-of-the-art performance was proven from the successful learning of the key features in protein-ligand interactions instead of just memorizing certain biased patterns from data.
Collapse
Affiliation(s)
- Dejun Jiang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China.,State Key Laboratory of CAD&CG, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Chang-Yu Hsieh
- Tencent Quantum Laboratory, Tencent, Shenzhen 518057, Guangdong, China
| | - Zhenxing Wu
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Jike Wang
- School of Computer Science, Wuhan University, Wuhan 430072, Hubei, China
| | - Ercheng Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China
| | - Ben Liao
- Tencent Quantum Laboratory, Tencent, Shenzhen 518057, Guangdong, China
| | - Chao Shen
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Jian Wu
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410004, Hunan, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.,State Key Laboratory of CAD&CG, Zhejiang University, Hangzhou 310058, Zhejiang, China
| |
Collapse
|
55
|
Wang X, Zhao R, Ji W, Zhou J, Liu Q, Zhao L, Shen Z, Liu S, Xu B. Discovery of Novel Indole Derivatives as Fructose-1,6-bisphosphatase Inhibitors and X-ray Cocrystal Structures Analysis. ACS Med Chem Lett 2021; 13:118-127. [PMID: 35059131 PMCID: PMC8762752 DOI: 10.1021/acsmedchemlett.1c00613] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Accepted: 12/15/2021] [Indexed: 01/16/2023] Open
Abstract
Liver fructose-1,6-bisphosphatase (FBPase) is a key enzyme in the gluconeogenesis, and its inhibitors are expected to be novel antidiabetic agents. Herein, a series of new indole and benzofuran analogues were designed and synthesized to evaluate the inhibitory activity against FBPase. As a result, the novel FBPase inhibitors bearing N-acylsulfonamide moiety on the 3-position of the indole-2-carboxylic acid scaffold (compounds 22f and 22g) were identified with IC50s at the submicromolar levels. Three X-ray crystal structures of the complexes were solved and revealed the structural basis for the inhibitory activity. The chemoinformatics analysis further disclosed the distinct binding features of this class of inhibitors, providing an insight for further modifications to create structurally distinct FBPase inhibitors with high potency and drug-like properties.
Collapse
Affiliation(s)
- Xiaoyu Wang
- Beijing
Key Laboratory of Active Substances Discovery and Druggability Evaluation,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China
| | - Rui Zhao
- Beijing
Key Laboratory of Active Substances Discovery and Druggability Evaluation,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China,School
of Pharmaceutical Engineering, Shenyang
Pharmaceutical University, Shenyang, 100016, China
| | - Wenming Ji
- State
Key Laboratory of Bioactive Substances and Functions of Natural Medicines,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China,Diabetes
Research Center, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, 100050, China
| | - Jie Zhou
- Beijing
Key Laboratory of Active Substances Discovery and Druggability Evaluation,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China
| | - Quan Liu
- State
Key Laboratory of Bioactive Substances and Functions of Natural Medicines,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China,Diabetes
Research Center, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, 100050, China
| | - Linxiang Zhao
- School
of Pharmaceutical Engineering, Shenyang
Pharmaceutical University, Shenyang, 100016, China
| | - Zhufang Shen
- State
Key Laboratory of Bioactive Substances and Functions of Natural Medicines,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China,Diabetes
Research Center, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, 100050, China
| | - Shuainan Liu
- State
Key Laboratory of Bioactive Substances and Functions of Natural Medicines,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China,Diabetes
Research Center, Chinese Academy of Medical
Sciences and Peking Union Medical College, Beijing, 100050, China,S.L. email,
| | - Bailing Xu
- Beijing
Key Laboratory of Active Substances Discovery and Druggability Evaluation,
Institute of Materia Medica, Chinese Academy
of Medical Sciences and Peking Union Medical College, Beijing, 100050, China,B.X.: email,
| |
Collapse
|
56
|
Wang DD, Chan MT, Yan H. Structure-based protein-ligand interaction fingerprints for binding affinity prediction. Comput Struct Biotechnol J 2021; 19:6291-6300. [PMID: 34900139 PMCID: PMC8637032 DOI: 10.1016/j.csbj.2021.11.018] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 11/09/2021] [Accepted: 11/13/2021] [Indexed: 11/17/2022] Open
Abstract
Binding affinity prediction (BAP) using protein–ligand complex structures is crucial to computer-aided drug design, but remains a challenging problem. To achieve efficient and accurate BAP, machine-learning scoring functions (SFs) based on a wide range of descriptors have been developed. Among those descriptors, protein–ligand interaction fingerprints (IFPs) are competitive due to their simple representations, elaborate profiles of key interactions and easy collaborations with machine-learning algorithms. In this paper, we have adopted a building-block-based taxonomy to review a broad range of IFP models, and compared representative IFP-based SFs in target-specific and generic scoring tasks. Atom-pair-counts-based and substructure-based IFPs show great potential in these tasks.
Collapse
Affiliation(s)
- Debby D Wang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, 516 Jungong Rd, Shanghai 200093, China
| | - Moon-Tong Chan
- School of Science and Technology, Hong Kong Metropolitan University, 30 Good Shepherd St, Ho Man Tin, Hong Kong
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| |
Collapse
|
57
|
Nguyen TB, Pires DEV, Ascher DB. CSM-carbohydrate: protein-carbohydrate binding affinity prediction and docking scoring function. Brief Bioinform 2021; 23:6457169. [PMID: 34882232 DOI: 10.1093/bib/bbab512] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Revised: 11/06/2021] [Accepted: 11/08/2021] [Indexed: 12/29/2022] Open
Abstract
Protein-carbohydrate interactions are crucial for many cellular processes but can be challenging to biologically characterise. To improve our understanding and ability to model these molecular interactions, we used a carefully curated set of 370 protein-carbohydrate complexes with experimental structural and biophysical data in order to train and validate a new tool, cutoff scanning matrix (CSM)-carbohydrate, using machine learning algorithms to accurately predict their binding affinity and rank docking poses as a scoring function. Information on both protein and carbohydrate complementarity, in terms of shape and chemistry, was captured using graph-based structural signatures. Across both training and independent test sets, we achieved comparable Pearson's correlations of 0.72 under cross-validation [root mean square error (RMSE) of 1.58 Kcal/mol] and 0.67 on the independent test (RMSE of 1.72 Kcal/mol), providing confidence in the generalisability and robustness of the final model. Similar performance was obtained across mono-, di- and oligosaccharides, further highlighting the applicability of this approach to the study of larger complexes. We show CSM-carbohydrate significantly outperformed previous approaches and have implemented our method and make all data freely available through both a user-friendly web interface and application programming interface, to facilitate programmatic access at http://biosig.unimelb.edu.au/csm_carbohydrate/. We believe CSM-carbohydrate will be an invaluable tool for helping assess docking poses and the effects of mutations on protein-carbohydrate affinity, unravelling important aspects that drive binding recognition.
Collapse
Affiliation(s)
- Thanh Binh Nguyen
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Systems and Computational Biology, Bio21 Institute, University of Melbourne, Melbourne, Victoria, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Australia
| | - Douglas E V Pires
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Systems and Computational Biology, Bio21 Institute, University of Melbourne, Melbourne, Victoria, Australia.,School of Computing and Information Systems, University of Melbourne, Melbourne, Victoria, Australia
| | - David B Ascher
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.,Systems and Computational Biology, Bio21 Institute, University of Melbourne, Melbourne, Victoria, Australia.,School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, Australia.,Department of Biochemistry, University of Cambridge, Cambridge, UK
| |
Collapse
|
58
|
Abstract
Virtual screening-predicting which compounds within a specified compound library bind to a target molecule, typically a protein-is a fundamental task in the field of drug discovery. Doing virtual screening well provides tangible practical benefits, including reduced drug development costs, faster time to therapeutic viability, and fewer unforeseen side effects. As with most applied computational tasks, the algorithms currently used to perform virtual screening feature inherent tradeoffs between speed and accuracy. Furthermore, even theoretically rigorous, computationally intensive methods may fail to account for important effects relevant to whether a given compound will ultimately be usable as a drug. Here we investigate the virtual screening performance of the recently released Gnina molecular docking software, which uses deep convolutional networks to score protein-ligand structures. We find, on average, that Gnina outperforms conventional empirical scoring. The default scoring in Gnina outperforms the empirical AutoDock Vina scoring function on 89 of the 117 targets of the DUD-E and LIT-PCBA virtual screening benchmarks with a median 1% early enrichment factor that is more than twice that of Vina. However, we also find that issues of bias linger in these sets, even when not used directly to train models, and this bias obfuscates to what extent machine learning models are achieving their performance through a sophisticated interpretation of molecular interactions versus fitting to non-informative simplistic property distributions.
Collapse
Affiliation(s)
| | - David Ryan Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA;
| |
Collapse
|
59
|
A geometric deep learning approach to predict binding conformations of bioactive molecules. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00409-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
60
|
Wang Y, Wu S, Duan Y, Huang Y. A point cloud-based deep learning strategy for protein-ligand binding affinity prediction. Brief Bioinform 2021; 23:6440132. [PMID: 34849569 DOI: 10.1093/bib/bbab474] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 09/21/2021] [Accepted: 10/15/2021] [Indexed: 01/14/2023] Open
Abstract
There is great interest to develop artificial intelligence-based protein-ligand binding affinity models due to their immense applications in drug discovery. In this paper, PointNet and PointTransformer, two pointwise multi-layer perceptrons have been applied for protein-ligand binding affinity prediction for the first time. Three-dimensional point clouds could be rapidly generated from PDBbind-2016 with 3772 and 11 327 individual point clouds derived from the refined or/and general sets, respectively. These point clouds (the refined or the extended set) were used to train PointNet or PointTransformer, resulting in protein-ligand binding affinity prediction models with Pearson correlation coefficients R = 0.795 or 0.833 from the extended data set, respectively, based on the CASF-2016 benchmark test. The analysis of parameters suggests that the two deep learning models were capable to learn many interactions between proteins and their ligands, and some key atoms for the interactions could be visualized. The protein-ligand interaction features learned by PointTransformer could be further adapted for the XGBoost-based machine learning algorithm, resulting in prediction models with an average Rp of 0.827, which is on par with state-of-the-art machine learning models. These results suggest that the point clouds derived from PDBbind data sets are useful to evaluate the performance of 3D point clouds-centered deep learning algorithms, which could learn atomic features of protein-ligand interactions from natural evolution or medicinal chemistry and thus have wide applications in chemistry and biology.
Collapse
Affiliation(s)
- Yeji Wang
- Xiangya International Academy of Translational Medicine, Central South University, Changsha, Hunan 410013, China
| | - Shuo Wu
- Xiangya International Academy of Translational Medicine, Central South University, Changsha, Hunan 410013, China
| | - Yanwen Duan
- Xiangya International Academy of Translational Medicine, Central South University, Changsha, Hunan 410013, China.,Hunan Engineering Research Center of Combinatorial Biosynthesis and Natural Product Drug Discover, Changsha, Hunan 410011, China.,National Engineering Research Center of Combinatorial Biosynthesis for Drug Discovery, Changsha, Hunan 410011, China
| | - Yong Huang
- Xiangya International Academy of Translational Medicine, Central South University, Changsha, Hunan 410013, China.,National Engineering Research Center of Combinatorial Biosynthesis for Drug Discovery, Changsha, Hunan 410011, China
| |
Collapse
|
61
|
Seo S, Choi J, Park S, Ahn J. Binding affinity prediction for protein-ligand complex using deep attention mechanism based on intermolecular interactions. BMC Bioinformatics 2021; 22:542. [PMID: 34749664 PMCID: PMC8576937 DOI: 10.1186/s12859-021-04466-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 10/08/2021] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Accurate prediction of protein-ligand binding affinity is important for lowering the overall cost of drug discovery in structure-based drug design. For accurate predictions, many classical scoring functions and machine learning-based methods have been developed. However, these techniques tend to have limitations, mainly resulting from a lack of sufficient energy terms to describe the complex interactions between proteins and ligands. Recent deep-learning techniques can potentially solve this problem. However, the search for more efficient and appropriate deep-learning architectures and methods to represent protein-ligand complex is ongoing. RESULTS In this study, we proposed a deep-neural network model to improve the prediction accuracy of protein-ligand complex binding affinity. The proposed model has two important features, descriptor embeddings with information on the local structures of a protein-ligand complex and an attention mechanism to highlight important descriptors for binding affinity prediction. The proposed model performed better than existing binding affinity prediction models on most benchmark datasets. CONCLUSIONS We confirmed that an attention mechanism can capture the binding sites in a protein-ligand complex to improve prediction performance. Our code is available at https://github.com/Blue1993/BAPA .
Collapse
Affiliation(s)
- Sangmin Seo
- Department of Computer Science, Yonsei University, Seoul, Republic of Korea
- UBLBio Corporation, 16679, Suwon, Republic of Korea
| | - Jonghwan Choi
- Department of Computer Science, Yonsei University, Seoul, Republic of Korea
- UBLBio Corporation, 16679, Suwon, Republic of Korea
| | - Sanghyun Park
- Department of Computer Science, Yonsei University, Seoul, Republic of Korea.
| | - Jaegyoon Ahn
- Department of Computer Science and Engineering, Incheon National University, Incheon, Republic of Korea.
| |
Collapse
|
62
|
Yuan H, Huang J, Li J. Protein-ligand binding affinity prediction model based on graph attention network. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:9148-9162. [PMID: 34814340 DOI: 10.3934/mbe.2021451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Estimating the binding affinity between proteins and drugs is very important in the application of structure-based drug design. Currently, applying machine learning to build the protein-ligand binding affinity prediction model, which is helpful to improve the performance of classical scoring functions, has attracted many scientists' attention. In this paper, we have developed an affinity prediction model called GAT-Score based on graph attention network (GAT). The protein-ligand complex is represented by a graph structure, and the atoms of protein and ligand are treated in the same manner. Two improvements are made to the original graph attention network. Firstly, a dynamic feature mechanism is designed to enable the model to deal with bond features. Secondly, a virtual super node is introduced to aggregate node-level features into graph-level features, so that the model can be used in the graph-level regression problems. PDBbind database v.2018 is used to train the model. Finally, the performance of GAT-Score was tested by the scheme $C_s$ (Core set as the test set) and CV (Cross-Validation). It has been found that our results are better than most methods from machine learning models with traditional molecular descriptors.
Collapse
Affiliation(s)
- Hong Yuan
- School of Medical Information and Engineering, Southwest Medical University, Luzhou, China
- Medicine & Engineering & Informatics Fusion and Transformation Key Laboratory of Luzhou City, Luzhou, China
| | - Jing Huang
- School of Medical Information and Engineering, Southwest Medical University, Luzhou, China
- Medicine & Engineering & Informatics Fusion and Transformation Key Laboratory of Luzhou City, Luzhou, China
| | - Jin Li
- School of Medical Information and Engineering, Southwest Medical University, Luzhou, China
- Medicine & Engineering & Informatics Fusion and Transformation Key Laboratory of Luzhou City, Luzhou, China
| |
Collapse
|
63
|
Bouysset C, Fiorucci S. ProLIF: a library to encode molecular interactions as fingerprints. J Cheminform 2021; 13:72. [PMID: 34563256 PMCID: PMC8466659 DOI: 10.1186/s13321-021-00548-6] [Citation(s) in RCA: 115] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Accepted: 08/30/2021] [Indexed: 12/21/2022] Open
Abstract
Interaction fingerprints are vector representations that summarize the three-dimensional nature of interactions in molecular complexes, typically formed between a protein and a ligand. This kind of encoding has found many applications in drug-discovery projects, from structure-based virtual-screening to machine-learning. Here, we present ProLIF, a Python library designed to generate interaction fingerprints for molecular complexes extracted from molecular dynamics trajectories, experimental structures, and docking simulations. It can handle complexes formed of any combination of ligand, protein, DNA, or RNA molecules. The available interaction types can be fully reparametrized or extended by user-defined ones. Several tutorials that cover typical use-case scenarios are available, and the documentation is accompanied with code snippets showcasing the integration with other data-analysis libraries for a more seamless user-experience. The library can be freely installed from our GitHub repository (https://github.com/chemosim-lab/ProLIF).
Collapse
Affiliation(s)
- Cédric Bouysset
- Institut de Chimie de Nice UMR7272, Université Côte d'Azur, CNRS, Nice, France.
| | - Sébastien Fiorucci
- Institut de Chimie de Nice UMR7272, Université Côte d'Azur, CNRS, Nice, France.
| |
Collapse
|
64
|
Di Filippo JI, Cavasotto CN. Guided structure-based ligand identification and design via artificial intelligence modeling. Expert Opin Drug Discov 2021; 17:71-78. [PMID: 34544293 DOI: 10.1080/17460441.2021.1979514] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
INTRODUCTION The implementation of Artificial Intelligence (AI) methodologies to drug discovery (DD) are on the rise. Several applications have been developed for structure-based DD, where AI methods provide an alternative framework for the identification of ligands for validated therapeutic targets, as well as the de novo design of ligands through generative models. AREAS COVERED Herein, the authors review the contributions between the 2019 to present period regarding the application of AI methods to structure-based virtual screening (SBVS) which encompasses mainly molecular docking applications - binding pose prediction and binary classification for ligand or hit identification-, as well as de novo drug design driven by machine learning (ML) generative models, and the validation of AI models in structure-based screening. Studies are reviewed in terms of their main objective, used databases, implemented methodology, input and output, and key results . EXPERT OPINION More profound analyses regarding the validity and applicability of AI methods in DD have begun to appear. In the near future, we expect to see more structure-based generative models- which are scarce in comparison to ligand-based generative models-, the implementation of standard guidelines for validating the generated structures, and more analyses regarding the validation of AI methods in structure-based DD.
Collapse
Affiliation(s)
- Juan I Di Filippo
- Computational Drug Design and Biomedical Informatics Laboratory, Instituto de Investigaciones en Medicina Traslacional (IIMT), CONICET-Universidad Austral, Pilar, Buenos Aires, Argentina.,Facultad de Ciencias Biomédicas, and Facultad de Ingeniería, Universidad Austral, Pilar, Buenos Aires, Argentina.,Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, Buenos Aires, Argentina
| | - Claudio N Cavasotto
- Computational Drug Design and Biomedical Informatics Laboratory, Instituto de Investigaciones en Medicina Traslacional (IIMT), CONICET-Universidad Austral, Pilar, Buenos Aires, Argentina.,Facultad de Ciencias Biomédicas, and Facultad de Ingeniería, Universidad Austral, Pilar, Buenos Aires, Argentina.,Austral Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, Buenos Aires, Argentina
| |
Collapse
|
65
|
Rifaioglu AS, Cetin Atalay R, Cansen Kahraman D, Doğan T, Martin M, Atalay V. MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery. Bioinformatics 2021; 37:693-704. [PMID: 33067636 DOI: 10.1093/bioinformatics/btaa858] [Citation(s) in RCA: 51] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2020] [Revised: 08/16/2020] [Accepted: 10/06/2020] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Identification of interactions between bioactive small molecules and target proteins is crucial for novel drug discovery, drug repurposing and uncovering off-target effects. Due to the tremendous size of the chemical space, experimental bioactivity screening efforts require the aid of computational approaches. Although deep learning models have been successful in predicting bioactive compounds, effective and comprehensive featurization of proteins, to be given as input to deep neural networks, remains a challenge. RESULTS Here, we present a novel protein featurization approach to be used in deep learning-based compound-target protein binding affinity prediction. In the proposed method, multiple types of protein features such as sequence, structural, evolutionary and physicochemical properties are incorporated within multiple 2D vectors, which is then fed to state-of-the-art pairwise input hybrid deep neural networks to predict the real-valued compound-target protein interactions. The method adopts the proteochemometric approach, where both the compound and target protein features are used at the input level to model their interaction. The whole system is called MDeePred and it is a new method to be used for the purposes of computational drug discovery and repositioning. We evaluated MDeePred on well-known benchmark datasets and compared its performance with the state-of-the-art methods. We also performed in vitro comparative analysis of MDeePred predictions with selected kinase inhibitors' action on cancer cells. MDeePred is a scalable method with sufficiently high predictive performance. The featurization approach proposed here can also be utilized for other protein-related predictive tasks. AVAILABILITY AND IMPLEMENTATION The source code, datasets, additional information and user instructions of MDeePred are available at https://github.com/cansyl/MDeePred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- A S Rifaioglu
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey.,Department of Computer Engineering, İskenderun Technical University, Hatay, Turkey
| | - R Cetin Atalay
- Graduate School of Informatics, Middle East Technical University, Ankara, Turkey.,Section of Pulmonary and Critical Care Medicine, The University of Chicago, Chicago, IL, USA
| | - D Cansen Kahraman
- Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
| | - T Doğan
- Department of Computer Engineering, Hacettepe University, Ankara, Turkey.,Institute of Informatics, Hacettepe University, Ankara, Turkey
| | - M Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, Hinxton, UK
| | - V Atalay
- Department of Computer Engineering, Middle East Technical University, Ankara, Turkey
| |
Collapse
|
66
|
Xiong G, Shen C, Yang Z, Jiang D, Liu S, Lu A, Chen X, Hou T, Cao D. Featurization strategies for protein–ligand interactions and their applications in scoring function development. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2021. [DOI: 10.1002/wcms.1567] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Affiliation(s)
- Guoli Xiong
- Xiangya School of Pharmaceutical Sciences Central South University Changsha China
| | - Chao Shen
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences Zhejiang University Hangzhou China
| | - Ziyi Yang
- Xiangya School of Pharmaceutical Sciences Central South University Changsha China
| | - Dejun Jiang
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences Zhejiang University Hangzhou China
- College of Computer Science and Technology Zhejiang University Hangzhou China
| | - Shao Liu
- Department of Pharmacy Xiangya Hospital, Central South University Changsha China
| | - Aiping Lu
- Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine Hong Kong Baptist University Hong Kong SAR China
| | - Xiang Chen
- Department of Dermatology, Hunan Engineering Research Center of Skin Health and Disease, Hunan Key Laboratory of Skin Cancer and Psoriasis Xiangya Hospital, Central South University Changsha China
| | - Tingjun Hou
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences Zhejiang University Hangzhou China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences Central South University Changsha China
- Institute for Advancing Translational Medicine in Bone & Joint Diseases, School of Chinese Medicine Hong Kong Baptist University Hong Kong SAR China
| |
Collapse
|
67
|
Sánchez-Cruz N, Medina-Franco JL, Mestres J, Barril X. Extended connectivity interaction features: improving binding affinity prediction through chemical description. Bioinformatics 2021; 37:1376-1382. [PMID: 33226061 DOI: 10.1093/bioinformatics/btaa982] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 10/27/2020] [Accepted: 11/10/2020] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Machine-learning scoring functions (SFs) have been found to outperform standard SFs for binding affinity prediction of protein-ligand complexes. A plethora of reports focus on the implementation of increasingly complex algorithms, while the chemical description of the system has not been fully exploited. RESULTS Herein, we introduce Extended Connectivity Interaction Features (ECIF) to describe protein-ligand complexes and build machine-learning SFs with improved predictions of binding affinity. ECIF are a set of protein-ligand atom-type pair counts that take into account each atom's connectivity to describe it and thus define the pair types. ECIF were used to build different machine-learning models to predict protein-ligand affinities (pKd/pKi). The models were evaluated in terms of 'scoring power' on the Comparative Assessment of Scoring Functions 2016. The best models built on ECIF achieved Pearson correlation coefficients of 0.857 when used on its own, and 0.866 when used in combination with ligand descriptors, demonstrating ECIF descriptive power. AVAILABILITY AND IMPLEMENTATION Data and code to reproduce all the results are freely available at https://github.com/DIFACQUIM/ECIF. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Norberto Sánchez-Cruz
- Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| | - José L Medina-Franco
- Department of Pharmacy, School of Chemistry, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| | - Jordi Mestres
- Research Group on Systems Pharmacology, Research Program on Biomedical Informatics (GRIB), IMIM Hospital del Mar Medical Research Institute and University Pompeu Fabra, Parc de Recerca Biomedica (PRBB), 08003 Barcelona, Catalonia, Spain
- Chemotargets SL, Parc Cientific de Barcelona (PCB), 08028 Barcelona, Catalonia, Spain
| | - Xavier Barril
- Institut de Biomedicina de la Universitat de Barcelona (IBUB) and Facultat de Farmacia, Universitat de Barcelona, 08028 Barcelona, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), 08010 Barcelona, Spain
| |
Collapse
|
68
|
Wee J, Xia K. Forman persistent Ricci curvature (FPRC)-based machine learning models for protein-ligand binding affinity prediction. Brief Bioinform 2021; 22:6262241. [PMID: 33940588 DOI: 10.1093/bib/bbab136] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Revised: 03/14/2021] [Accepted: 03/23/2021] [Indexed: 01/01/2023] Open
Abstract
Artificial intelligence (AI) techniques have already been gradually applied to the entire drug design process, from target discovery, lead discovery, lead optimization and preclinical development to the final three phases of clinical trials. Currently, one of the central challenges for AI-based drug design is molecular featurization, which is to identify or design appropriate molecular descriptors or fingerprints. Efficient and transferable molecular descriptors are key to the success of all AI-based drug design models. Here we propose Forman persistent Ricci curvature (FPRC)-based molecular featurization and feature engineering, for the first time. Molecular structures and interactions are modeled as simplicial complexes, which are generalization of graphs to their higher dimensional counterparts. Further, a multiscale representation is achieved through a filtration process, during which a series of nested simplicial complexes at different scales are generated. Forman Ricci curvatures (FRCs) are calculated on the series of simplicial complexes, and the persistence and variation of FRCs during the filtration process is defined as FPRC. Moreover, persistent attributes, which are FPRC-based functions and properties, are employed as molecular descriptors, and combined with machine learning models, in particular, gradient boosting tree (GBT). Our FPRC-GBT models are extensively trained and tested on three most commonly-used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. It has been found that our results are better than the ones from machine learning models with traditional molecular descriptors.
Collapse
Affiliation(s)
- JunJie Wee
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
| |
Collapse
|
69
|
Meng Z, Xia K. Persistent spectral-based machine learning (PerSpect ML) for protein-ligand binding affinity prediction. SCIENCE ADVANCES 2021; 7:7/19/eabc5329. [PMID: 33962954 PMCID: PMC8104863 DOI: 10.1126/sciadv.abc5329] [Citation(s) in RCA: 80] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Accepted: 03/18/2021] [Indexed: 05/11/2023]
Abstract
Molecular descriptors are essential to not only quantitative structure-activity relationship (QSAR) models but also machine learning-based material, chemical, and biological data analysis. Here, we propose persistent spectral-based machine learning (PerSpect ML) models for drug design. Different from all previous spectral models, a filtration process is introduced to generate a sequence of spectral models at various different scales. PerSpect attributes are defined as the function of spectral variables over the filtration value. Molecular descriptors obtained from PerSpect attributes are combined with machine learning models for protein-ligand binding affinity prediction. Our results, for the three most commonly used databases including PDBbind-2007, PDBbind-2013, and PDBbind-2016, are better than all existing models, as far as we know. The proposed PerSpect theory provides a powerful feature engineering framework. PerSpect ML models demonstrate great potential to significantly improve the performance of learning models in molecular data analysis.
Collapse
Affiliation(s)
- Zhenyu Meng
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore.
| |
Collapse
|
70
|
Kimber TB, Chen Y, Volkamer A. Deep Learning in Virtual Screening: Recent Applications and Developments. Int J Mol Sci 2021; 22:4435. [PMID: 33922714 PMCID: PMC8123040 DOI: 10.3390/ijms22094435] [Citation(s) in RCA: 66] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 04/13/2021] [Accepted: 04/14/2021] [Indexed: 01/03/2023] Open
Abstract
Drug discovery is a cost and time-intensive process that is often assisted by computational methods, such as virtual screening, to speed up and guide the design of new compounds. For many years, machine learning methods have been successfully applied in the context of computer-aided drug discovery. Recently, thanks to the rise of novel technologies as well as the increasing amount of available chemical and bioactivity data, deep learning has gained a tremendous impact in rational active compound discovery. Herein, recent applications and developments of machine learning, with a focus on deep learning, in virtual screening for active compound design are reviewed. This includes introducing different compound and protein encodings, deep learning techniques as well as frequently used bioactivity and benchmark data sets for model training and testing. Finally, the present state-of-the-art, including the current challenges and emerging problems, are examined and discussed.
Collapse
Affiliation(s)
| | | | - Andrea Volkamer
- In Silico Toxicology and Structural Bioinformatics, Institute of Physiology, Charité-Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany; (T.B.K.); (Y.C.)
| |
Collapse
|
71
|
Vaškevičius M, Kapočiūtė-Dzikienė J, Šlepikas L. Prediction of Chromatography Conditions for Purification in Organic Synthesis Using Deep Learning. Molecules 2021; 26:2474. [PMID: 33922736 PMCID: PMC8123027 DOI: 10.3390/molecules26092474] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 04/15/2021] [Accepted: 04/22/2021] [Indexed: 01/27/2023] Open
Abstract
In this research, a process for developing normal-phase liquid chromatography solvent systems has been proposed. In contrast to the development of conditions via thin-layer chromatography (TLC), this process is based on the architecture of two hierarchically connected neural network-based components. Using a large database of reaction procedures allows those two components to perform an essential role in the machine-learning-based prediction of chromatographic purification conditions, i.e., solvents and the ratio between solvents. In our paper, we build two datasets and test various molecular vectorization approaches, such as extended-connectivity fingerprints, learned embedding, and auto-encoders along with different types of deep neural networks to demonstrate a novel method for modeling chromatographic solvent systems employing two neural networks in sequence. Afterward, we present our findings and provide insights on the most effective methods for solving prediction tasks. Our approach results in a system of two neural networks with long short-term memory (LSTM)-based auto-encoders, where the first predicts solvent labels (by reaching the classification accuracy of 0.950 ± 0.001) and in the case of two solvents, the second one predicts the ratio between two solvents (R2 metric equal to 0.982 ± 0.001). Our approach can be used as a guidance instrument in laboratories to accelerate scouting for suitable chromatography conditions.
Collapse
Affiliation(s)
- Mantas Vaškevičius
- Department of Applied Informatics, Vytautas Magnus University, LT-44404 Kaunas, Lithuania;
- JSC Synhet, Biržų Str. 6, LT-44139 Kaunas, Lithuania;
| | | | | |
Collapse
|
72
|
Liu X, Feng H, Wu J, Xia K. Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction. Brief Bioinform 2021; 22:6219114. [PMID: 33837771 DOI: 10.1093/bib/bbab127] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 03/14/2021] [Accepted: 03/16/2021] [Indexed: 12/21/2022] Open
Abstract
Molecular descriptors are essential to not only quantitative structure activity/property relationship (QSAR/QSPR) models, but also machine learning based chemical and biological data analysis. In this paper, we propose persistent spectral hypergraph (PSH) based molecular descriptors or fingerprints for the first time. Our PSH-based molecular descriptors are used in the characterization of molecular structures and interactions, and further combined with machine learning models, in particular gradient boosting tree (GBT), for protein-ligand binding affinity prediction. Different from traditional molecular descriptors, which are usually based on molecular graph models, a hypergraph-based topological representation is proposed for protein-ligand interaction characterization. Moreover, a filtration process is introduced to generate a series of nested hypergraphs in different scales. For each of these hypergraphs, its eigen spectrum information can be obtained from the corresponding (Hodge) Laplacain matrix. PSH studies the persistence and variation of the eigen spectrum of the nested hypergraphs during the filtration process. Molecular descriptors or fingerprints can be generated from persistent attributes, which are statistical or combinatorial functions of PSH, and combined with machine learning models, in particular, GBT. We test our PSH-GBT model on three most commonly used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. Our results, for all these databases, are better than all existing machine learning models with traditional molecular descriptors, as far as we know.
Collapse
Affiliation(s)
- Xiang Liu
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371.,Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China, 300071.,Center for Topology and Geometry Based Technology, Hebei Normal University, Hebei, China, 050024
| | - Huitao Feng
- Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China, 300071.,Mathematical Science Research Center, Chongqing University of Technology, Chongqing, China, 400054
| | - Jie Wu
- Center for Topology and Geometry Based Technology, Hebei Normal University, Hebei, China, 050024.,School of Mathematical Sciences, Hebei Normal University, Hebei, China, 050024
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
| |
Collapse
|
73
|
Kumar S, Kim MH. SMPLIP-Score: predicting ligand binding affinity from simple and interpretable on-the-fly interaction fingerprint pattern descriptors. J Cheminform 2021; 13:28. [PMID: 33766140 PMCID: PMC7993508 DOI: 10.1186/s13321-021-00507-1] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Accepted: 03/16/2021] [Indexed: 12/13/2022] Open
Abstract
In drug discovery, rapid and accurate prediction of protein–ligand binding affinities is a pivotal task for lead optimization with acceptable on-target potency as well as pharmacological efficacy. Furthermore, researchers hope for a high correlation between docking score and pose with key interactive residues, although scoring functions as free energy surrogates of protein–ligand complexes have failed to provide collinearity. Recently, various machine learning or deep learning methods have been proposed to overcome the drawbacks of scoring functions. Despite being highly accurate, their featurization process is complex and the meaning of the embedded features cannot directly be interpreted by human recognition without an additional feature analysis. Here, we propose SMPLIP-Score (Substructural Molecular and Protein–Ligand Interaction Pattern Score), a direct interpretable predictor of absolute binding affinity. Our simple featurization embeds the interaction fingerprint pattern on the ligand-binding site environment and molecular fragments of ligands into an input vectorized matrix for learning layers (random forest or deep neural network). Despite their less complex features than other state-of-the-art models, SMPLIP-Score achieved comparable performance, a Pearson’s correlation coefficient up to 0.80, and a root mean square error up to 1.18 in pK units with several benchmark datasets (PDBbind v.2015, Astex Diverse Set, CSAR NRC HiQ, FEP, PDBbind NMR, and CASF-2016). For this model, generality, predictive power, ranking power, and robustness were examined using direct interpretation of feature matrices for specific targets. ![]()
Collapse
Affiliation(s)
- Surendra Kumar
- Gachon Institute of Pharmaceutical Science & Department of Pharmacy, College of Pharmacy, Gachon University, 191 Hambakmoeiro, Yeonsu-gu, Incheon, Republic of Korea
| | - Mi-Hyun Kim
- Gachon Institute of Pharmaceutical Science & Department of Pharmacy, College of Pharmacy, Gachon University, 191 Hambakmoeiro, Yeonsu-gu, Incheon, Republic of Korea.
| |
Collapse
|
74
|
Wee J, Xia K. Ollivier Persistent Ricci Curvature-Based Machine Learning for the Protein-Ligand Binding Affinity Prediction. J Chem Inf Model 2021; 61:1617-1626. [PMID: 33724038 DOI: 10.1021/acs.jcim.0c01415] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Efficient molecular featurization is one of the major issues for machine learning models in drug design. Here, we propose a persistent Ricci curvature (PRC), in particular, Ollivier PRC (OPRC), for the molecular featurization and feature engineering, for the first time. The filtration process proposed in the persistent homology is employed to generate a series of nested molecular graphs. Persistence and variation of Ollivier Ricci curvatures on these nested graphs are defined as OPRC. Moreover, persistent attributes, which are statistical and combinatorial properties of OPRCs during the filtration process, are used as molecular descriptors and further combined with machine learning models, in particular, gradient boosting tree (GBT). Our OPRC-GBT model is used in the prediction of the protein-ligand binding affinity, which is one of the key steps in drug design. Based on three of the most commonly used data sets from the well-established protein-ligand binding databank, that is, PDBbind, we intensively test our model and compare with existing models. It has been found that our model can achieve the state-of-the-art results and has advantages over traditional molecular descriptors.
Collapse
Affiliation(s)
- JunJie Wee
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore
| |
Collapse
|
75
|
Wang DD, Xie H, Yan H. Proteo-chemometrics interaction fingerprints of protein-ligand complexes predict binding affinity. Bioinformatics 2021; 37:2570-2579. [PMID: 33650636 DOI: 10.1093/bioinformatics/btab132] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2020] [Revised: 01/10/2021] [Accepted: 02/25/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Reliable predictive models of protein-ligand binding affinity are required in many areas of biomedical research. Accurate prediction based on current descriptors or molecular fingerprints remains a challenge. We develop novel interaction fingerprints (IFPs) to encode protein-ligand interactions and use them to improve the prediction. RESULTS Proteo-chemometrics IFPs (PrtCmm IFPs) formed by combining extended connectivity fingerprints (ECFPs) with the proteo-chemometrics concept, were developed. Combining PrtCmm IFPs with machine-learning models led to efficient scoring models, which were validated on the PDBbind v2019 core set and CSAR-HiQ sets. The PrtCmm IFP Score outperformed several other models in predicting protein-ligand binding affinities. Besides, conventional ECFPs were simplified to generate new IFPs, which provided consistent but faster predictions. The relationship between the base atom properties of ECFPs and the accuracy of predictions was also investigated. AVAILABILITY PrtCmm IFP has been implemented in the IFP Score Toolkit on github https://github.com/debbydanwang/IFPscore. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Debby D Wang
- Institute of Medical Information Engineering, School of Medical Instrument and Food Engineering,University of Shanghai for Science and Technology, 516 Jungong Rd, Shanghai 200093, China
| | - Haoran Xie
- Department of Computing and Decision Sciences, Lingnan University, 8 Castle Peak Rd, Tuen Mun, Hong Kong
| | - Hong Yan
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| |
Collapse
|
76
|
Liu X, Wang X, Wu J, Xia K. Hypergraph-based persistent cohomology (HPC) for molecular representations in drug design. Brief Bioinform 2021; 22:6105940. [PMID: 33480394 DOI: 10.1093/bib/bbaa411] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2020] [Revised: 12/02/2020] [Indexed: 12/30/2022] Open
Abstract
Artificial intelligence (AI) based drug design has demonstrated great potential to fundamentally change the pharmaceutical industries. Currently, a key issue in AI-based drug design is efficient transferable molecular descriptors or fingerprints. Here, we present hypergraph-based molecular topological representation, hypergraph-based (weighted) persistent cohomology (HPC/HWPC) and HPC/HWPC-based molecular fingerprints for machine learning models in drug design. Molecular structures and their atomic interactions are highly complicated and pose great challenges for efficient mathematical representations. We develop the first hypergraph-based topological framework to characterize detailed molecular structures and interactions at atomic level. Inspired by the elegant path complex model, hypergraph-based embedded homology and persistent homology have been proposed recently. Based on them, we construct HPC/HWPC, and use them to generate molecular descriptors for learning models in protein-ligand binding affinity prediction, one of the key step in drug design. Our models are tested on three most commonly-used databases, including PDBbind-v2007, PDBbind-v2013 and PDBbind-v2016, and outperform all existing machine learning models with traditional molecular descriptors. Our HPC/HWPC models have demonstrated great potential in AI-based drug design.
Collapse
Affiliation(s)
- Xiang Liu
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore.,School of Mathematical Science and LPMC, Nankai University, 300071, Tianjin, China.,Center for Topology and Geometry Based Technology, Hebei Normal University, 050024, Hebei, China
| | - Xiangjun Wang
- School of Mathematical Science and LPMC, Nankai University, 300071, Tianjin, China
| | - Jie Wu
- Center for Topology and Geometry Based Technology, Hebei Normal University, 050024, Hebei, China.,School of Mathematical Sciences, Hebei Normal University, 050024, Hebei, China
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 637371, Singapore
| |
Collapse
|
77
|
Artificial intelligence in the early stages of drug discovery. Arch Biochem Biophys 2020; 698:108730. [PMID: 33347838 DOI: 10.1016/j.abb.2020.108730] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 12/11/2020] [Accepted: 12/14/2020] [Indexed: 02/07/2023]
Abstract
Although the use of computational methods within the pharmaceutical industry is well established, there is an urgent need for new approaches that can improve and optimize the pipeline of drug discovery and development. In spite of the fact that there is no unique solution for this need for innovation, there has recently been a strong interest in the use of Artificial Intelligence for this purpose. As a matter of fact, not only there have been major contributions from the scientific community in this respect, but there has also been a growing partnership between the pharmaceutical industry and Artificial Intelligence companies. Beyond these contributions and efforts there is an underlying question, which we intend to discuss in this review: can the intrinsic difficulties within the drug discovery process be overcome with the implementation of Artificial Intelligence? While this is an open question, in this work we will focus on the advantages that these algorithms provide over the traditional methods in the context of early drug discovery.
Collapse
|
78
|
Nguyen DD, Gao K, Chen J, Wang R, Wei GW. Unveiling the molecular mechanism of SARS-CoV-2 main protease inhibition from 137 crystal structures using algebraic topology and deep learning. Chem Sci 2020; 11:12036-12046. [PMID: 34123218 PMCID: PMC8162568 DOI: 10.1039/d0sc04641h] [Citation(s) in RCA: 57] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Accepted: 09/30/2020] [Indexed: 12/27/2022] Open
Abstract
Currently, there is neither effective antiviral drugs nor vaccine for coronavirus disease 2019 (COVID-19) caused by acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Due to its high conservativeness and low similarity with human genes, SARS-CoV-2 main protease (Mpro) is one of the most favorable drug targets. However, the current understanding of the molecular mechanism of Mpro inhibition is limited by the lack of reliable binding affinity ranking and prediction of existing structures of Mpro-inhibitor complexes. This work integrates mathematics (i.e., algebraic topology) and deep learning (MathDL) to provide a reliable ranking of the binding affinities of 137 SARS-CoV-2 Mpro inhibitor structures. We reveal that Gly143 residue in Mpro is the most attractive site to form hydrogen bonds, followed by Glu166, Cys145, and His163. We also identify 71 targeted covalent bonding inhibitors. MathDL was validated on the PDBbind v2016 core set benchmark and a carefully curated SARS-CoV-2 inhibitor dataset to ensure the reliability of the present binding affinity prediction. The present binding affinity ranking, interaction analysis, and fragment decomposition offer a foundation for future drug discovery efforts.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics, University of Kentucky KY 40506 USA
| | - Kaifu Gao
- Department of Mathematics, Michigan State University MI 48824 USA
| | - Jiahui Chen
- Department of Mathematics, Michigan State University MI 48824 USA
| | - Rui Wang
- Department of Mathematics, Michigan State University MI 48824 USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University MI 48824 USA
- Department of Biochemistry and Molecular Biology, Michigan State University MI 48824 USA
- Department of Electrical and Computer Engineering, Michigan State University MI 48824 USA
| |
Collapse
|
79
|
Francoeur PG, Masuda T, Sunseri J, Jia A, Iovanisci RB, Snyder I, Koes DR. Three-Dimensional Convolutional Neural Networks and a Cross-Docked Data Set for Structure-Based Drug Design. J Chem Inf Model 2020; 60:4200-4215. [PMID: 32865404 PMCID: PMC8902699 DOI: 10.1021/acs.jcim.0c00411] [Citation(s) in RCA: 81] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard data set of sufficient size to compare performance between models. We present a new data set for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank, and perform a comprehensive evaluation of grid-based convolutional neural network (CNN) models on this data set. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind data set, how performance improves by adding more lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of five densely connected CNNs, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized data set for training machine learning models to recognize ligands in noncognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this data set for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.
Collapse
Affiliation(s)
- Paul G Francoeur
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Tomohide Masuda
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Jocelyn Sunseri
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Andrew Jia
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Richard B Iovanisci
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - Ian Snyder
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| | - David R Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, United States
| |
Collapse
|
80
|
Selecting machine-learning scoring functions for structure-based virtual screening. DRUG DISCOVERY TODAY. TECHNOLOGIES 2020; 32-33:81-87. [PMID: 33386098 DOI: 10.1016/j.ddtec.2020.09.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 09/02/2020] [Accepted: 09/07/2020] [Indexed: 12/27/2022]
Abstract
Interest in docking technologies has grown parallel to the ever increasing number and diversity of 3D models for macromolecular therapeutic targets. Structure-Based Virtual Screening (SBVS) aims at leveraging these experimental structures to discover the necessary starting points for the drug discovery process. It is now established that Machine Learning (ML) can strongly enhance the predictive accuracy of scoring functions for SBVS by exploiting large datasets from targets, molecules and their associations. However, with greater choice, the question of which ML-based scoring function is the most suitable for prospective use on a given target has gained importance. Here we analyse two approaches to select an existing scoring function for the target along with a third approach consisting in generating a scoring function tailored to the target. These analyses required discussing the limitations of popular SBVS benchmarks, the alternatives to benchmark scoring functions for SBVS and how to generate them or use them using freely-available software.
Collapse
|
81
|
Adeshina YO, Deeds EJ, Karanicolas J. Machine learning classification can reduce false positives in structure-based virtual screening. Proc Natl Acad Sci U S A 2020; 117:18477-18488. [PMID: 32669436 PMCID: PMC7414157 DOI: 10.1073/pnas.2000585117] [Citation(s) in RCA: 99] [Impact Index Per Article: 24.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
With the recent explosion in the size of libraries available for screening, virtual screening is positioned to assume a more prominent role in early drug discovery's search for active chemical matter. In typical virtual screens, however, only about 12% of the top-scoring compounds actually show activity when tested in biochemical assays. We argue that most scoring functions used for this task have been developed with insufficient thoughtfulness into the datasets on which they are trained and tested, leading to overly simplistic models and/or overtraining. These problems are compounded in the literature because studies reporting new scoring methods have not validated their models prospectively within the same study. Here, we report a strategy for building a training dataset (D-COID) that aims to generate highly compelling decoy complexes that are individually matched to available active complexes. Using this dataset, we train a general-purpose classifier for virtual screening (vScreenML) that is built on the XGBoost framework. In retrospective benchmarks, our classifier shows outstanding performance relative to other scoring functions. In a prospective context, nearly all candidate inhibitors from a screen against acetylcholinesterase show detectable activity; beyond this, 10 of 23 compounds have IC50 better than 50 μM. Without any medicinal chemistry optimization, the most potent hit has IC50 280 nM, corresponding to Ki of 173 nM. These results support using the D-COID strategy for training classifiers in other computational biology tasks, and for vScreenML in virtual screening campaigns against other protein targets. Both D-COID and vScreenML are freely distributed to facilitate such efforts.
Collapse
Affiliation(s)
- Yusuf O Adeshina
- Program in Molecular Therapeutics, Fox Chase Cancer Center, Philadelphia, PA 19111
- Center for Computational Biology, University of Kansas, Lawrence, KS 66045
| | - Eric J Deeds
- Center for Computational Biology, University of Kansas, Lawrence, KS 66045
- Department of Molecular Biosciences, University of Kansas, Lawrence, KS 66045
| | - John Karanicolas
- Program in Molecular Therapeutics, Fox Chase Cancer Center, Philadelphia, PA 19111;
| |
Collapse
|
82
|
Wang DD, Zhu M, Yan H. Computationally predicting binding affinity in protein-ligand complexes: free energy-based simulations and machine learning-based scoring functions. Brief Bioinform 2020; 22:5860693. [PMID: 32591817 DOI: 10.1093/bib/bbaa107] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2020] [Revised: 04/20/2020] [Accepted: 05/05/2020] [Indexed: 12/18/2022] Open
Abstract
Accurately predicting protein-ligand binding affinities can substantially facilitate the drug discovery process, but it remains as a difficult problem. To tackle the challenge, many computational methods have been proposed. Among these methods, free energy-based simulations and machine learning-based scoring functions can potentially provide accurate predictions. In this paper, we review these two classes of methods, following a number of thermodynamic cycles for the free energy-based simulations and a feature-representation taxonomy for the machine learning-based scoring functions. More recent deep learning-based predictions, where a hierarchy of feature representations are generally extracted, are also reviewed. Strengths and weaknesses of the two classes of methods, coupled with future directions for improvements, are comparatively discussed.
Collapse
Affiliation(s)
- Debby D Wang
- School of Medical Instrument and Food Engineering, University of Shanghai for Science and Technology
| | - Mengxu Zhu
- Department of Electrical Engineering, City University of Hong Kong
| | - Hong Yan
- College of Science and Engineering, City University of Hong Kong
| |
Collapse
|
83
|
Shen C, Hu Y, Wang Z, Zhang X, Pang J, Wang G, Zhong H, Xu L, Cao D, Hou T. Beware of the generic machine learning-based scoring functions in structure-based virtual screening. Brief Bioinform 2020; 22:5850047. [PMID: 32484221 DOI: 10.1093/bib/bbaa070] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 04/17/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Machine learning-based scoring functions (MLSFs) have attracted extensive attention recently and are expected to be potential rescoring tools for structure-based virtual screening (SBVS). However, a major concern nowadays is whether MLSFs trained for generic uses rather than a given target can consistently be applicable for VS. In this study, a systematic assessment was carried out to re-evaluate the effectiveness of 14 reported MLSFs in VS. Overall, most of these MLSFs could hardly achieve satisfactory results for any dataset, and they could even not outperform the baseline of classical SFs such as Glide SP. An exception was observed for RFscore-VS trained on the Directory of Useful Decoys-Enhanced dataset, which showed its superiority for most targets. However, in most cases, it clearly illustrated rather limited performance on the targets that were dissimilar to the proteins in the corresponding training sets. We also used the top three docking poses rather than the top one for rescoring and retrained the models with the updated versions of the training set, but only minor improvements were observed. Taken together, generic MLSFs may have poor generalization capabilities to be applicable for the real VS campaigns. Therefore, it should be quite cautious to use this type of methods for VS.
Collapse
Affiliation(s)
| | - Ye Hu
- Central South University, China
| | | | | | | | | | | | - Lei Xu
- Central South University, China
| | | | | |
Collapse
|
84
|
Gao K, Nguyen DD, Sresht V, Mathiowetz AM, Tu M, Wei GW. Are 2D fingerprints still valuable for drug discovery? Phys Chem Chem Phys 2020; 22:8373-8390. [PMID: 32266895 PMCID: PMC7224332 DOI: 10.1039/d0cp00305k] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Recently, molecular fingerprints extracted from three-dimensional (3D) structures using advanced mathematics, such as algebraic topology, differential geometry, and graph theory have been paired with efficient machine learning, especially deep learning algorithms to outperform other methods in drug discovery applications and competitions. This raises the question of whether classical 2D fingerprints are still valuable in computer-aided drug discovery. This work considers 23 datasets associated with four typical problems, namely protein-ligand binding, toxicity, solubility and partition coefficient to assess the performance of eight 2D fingerprints. Advanced machine learning algorithms including random forest, gradient boosted decision tree, single-task deep neural network and multitask deep neural network are employed to construct efficient 2D-fingerprint based models. Additionally, appropriate consensus models are built to further enhance the performance of 2D-fingerprint-based methods. It is demonstrated that 2D-fingerprint-based models perform as well as the state-of-the-art 3D structure-based models for the predictions of toxicity, solubility, partition coefficient and protein-ligand binding affinity based on only ligand information. However, 3D structure-based models outperform 2D fingerprint-based methods in complex-based protein-ligand binding affinity predictions.
Collapse
Affiliation(s)
- Kaifu Gao
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Duc Duy Nguyen
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Vishnu Sresht
- Pfizer Medicine Design, 610 Main St, Cambridge, MA 02139, USA
| | | | - Meihua Tu
- Pfizer Medicine Design, 610 Main St, Cambridge, MA 02139, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA. and Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA and Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA
| |
Collapse
|
85
|
Li H, Sze K, Lu G, Ballester PJ. Machine‐learning scoring functions for structure‐based virtual screening. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1478] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Hongjian Li
- Cancer Research Center of Marseille (INSERM U1068, Institut Paoli‐Calmettes, Aix‐Marseille Université UM105, CNRS UMR7258) Marseille France
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Kam‐Heung Sze
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Gang Lu
- CUHK‐SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences Chinese University of Hong Kong Shatin Hong Kong
| | - Pedro J. Ballester
- Cancer Research Center of Marseille (INSERM U1068, Institut Paoli‐Calmettes, Aix‐Marseille Université UM105, CNRS UMR7258) Marseille France
| |
Collapse
|
86
|
Abstract
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on protein-ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energies, toxicity, partition coefficients, protein folding stability changes upon mutation, etc.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Zixuan Cang
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA. and Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA and Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|
87
|
Xu Y, Cai C, Wang S, Lai L, Pei J. Efficient molecular encoders for virtual screening. DRUG DISCOVERY TODAY. TECHNOLOGIES 2019; 32-33:19-27. [PMID: 33386090 DOI: 10.1016/j.ddtec.2020.08.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2020] [Revised: 08/23/2020] [Accepted: 08/28/2020] [Indexed: 06/12/2023]
Abstract
Molecular representations encoding molecular structure information play critical roles in molecular virtual screening (VS). In order to improve VS performance, an abundance of molecular encoders have been developed and tested by various VS challenges. Combinational strategies were also used to improve the performance. Deep learning (DL)-based molecular encoders have attracted much attention for their automatic information extraction ability. In this review, we present an overview of two-dimensional-, three-dimensional-, and DL-based molecular encoders, summarize recent progress of VS using DL technologies, and propose a general framework of DL molecular encoder-based VS. Perspectives on the future directions of molecular representations and applications in the prediction of active compounds are also provided.
Collapse
Affiliation(s)
- Youjun Xu
- BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, PR China
| | - Chenjing Cai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, PR China
| | - Shiwei Wang
- PTN Graduate Program, Academy for Advanced Interdisciplinary Studies, Peking University, 100871, PR China
| | - Luhua Lai
- BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking University, Beijing 100871, PR China; Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, PR China.
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871, PR China.
| |
Collapse
|
88
|
Boyles F, Deane CM, Morris GM. Learning from the ligand: using ligand-based features to improve binding affinity prediction. Bioinformatics 2019; 36:758-764. [DOI: 10.1093/bioinformatics/btz665] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 08/14/2019] [Accepted: 08/21/2019] [Indexed: 12/27/2022] Open
Abstract
Abstract
Motivation
Machine learning scoring functions for protein–ligand binding affinity prediction have been found to consistently outperform classical scoring functions. Structure-based scoring functions for universal affinity prediction typically use features describing interactions derived from the protein–ligand complex, with limited information about the chemical or topological properties of the ligand itself.
Results
We demonstrate that the performance of machine learning scoring functions are consistently improved by the inclusion of diverse ligand-based features. For example, a Random Forest (RF) combining the features of RF-Score v3 with RDKit molecular descriptors achieved Pearson correlation coefficients of up to 0.836, 0.780 and 0.821 on the PDBbind 2007, 2013 and 2016 core sets, respectively, compared to 0.790, 0.746 and 0.814 when using the features of RF-Score v3 alone. Excluding proteins and/or ligands that are similar to those in the test sets from the training set has a significant effect on scoring function performance, but does not remove the predictive power of ligand-based features. Furthermore a RF using only ligand-based features is predictive at a level similar to classical scoring functions and it appears to be predicting the mean binding affinity of a ligand for its protein targets.
Availability and implementation
Data and code to reproduce all the results are freely available at http://opig.stats.ox.ac.uk/resources.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fergus Boyles
- Department of Statistics, University of Oxford, Oxford, UK
| | | | | |
Collapse
|
89
|
Nguyen DD, Wei GW. AGL-Score: Algebraic Graph Learning Score for Protein-Ligand Binding Scoring, Ranking, Docking, and Screening. J Chem Inf Model 2019; 59:3291-3304. [PMID: 31257871 PMCID: PMC6664294 DOI: 10.1021/acs.jcim.9b00334] [Citation(s) in RCA: 128] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Although algebraic graph theory-based models have been widely applied in physical modeling and molecular studies, they are typically incompetent in the analysis and prediction of biomolecular properties, confirming the common belief that "one cannot hear the shape of a drum". A new development in the century-old question about the spectrum-geometry relationship is provided. Novel algebraic graph learning score (AGL-Score) models are proposed to encode high-dimensional physical and biological information into intrinsically low-dimensional representations. The proposed AGL-Score models employ multiscale weighted colored subgraphs to describe crucial molecular and biomolecular interactions in terms of graph invariants derived from graph Laplacian, its pseudo-inverse, and adjacency matrices. Additionally, AGL-Score models are integrated with an advanced machine learning algorithm to predict biomolecular macroscopic properties from the low-dimensional graph representation of biomolecular structures. The proposed AGL-Score models are extensively validated for their scoring power, ranking power, docking power, and screening power via a number of benchmark datasets, namely CASF-2007, CASF-2013, and CASF-2016. Numerical results indicate that the proposed AGL-Score models are able to outperform other state-of-the-art scoring functions in protein-ligand binding scoring, ranking, docking, and screening. This study indicates that machine learning methods are powerful tools for molecular docking and virtual screening. It also indicates that spectral geometry or spectral graph theory has the ability to infer geometric properties.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics , Michigan State University , East Lansing , Michigan 48824 , United States
| | - Guo-Wei Wei
- Department of Mathematics , Michigan State University , East Lansing , Michigan 48824 , United States
- Department of Biochemistry and Molecular Biology Michigan State University , East Lansing , Michigan 48824 , United States
- Department of Electrical and Computer Engineering Michigan State University , East Lansing , Michigan 48824 , United States
| |
Collapse
|
90
|
Jeon W, Kim D. FP2VEC: a new molecular featurizer for learning molecular properties. Bioinformatics 2019; 35:4979-4985. [DOI: 10.1093/bioinformatics/btz307] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Revised: 03/28/2019] [Accepted: 04/24/2019] [Indexed: 12/25/2022] Open
Abstract
Abstract
Motivation
One of the most successful methods for predicting the properties of chemical compounds is the quantitative structure–activity relationship (QSAR) methods. The prediction accuracy of QSAR models has recently been greatly improved by employing deep learning technology. Especially, newly developed molecular featurizers based on graph convolution operations on molecular graphs significantly outperform the conventional extended connectivity fingerprints (ECFP) feature in both classification and regression tasks, indicating that it is critical to develop more effective new featurizers to fully realize the power of deep learning techniques. Motivated by the fact that there is a clear analogy between chemical compounds and natural languages, this work develops a new molecular featurizer, FP2VEC, which represents a chemical compound as a set of trainable embedding vectors.
Results
To implement and test our new featurizer, we build a QSAR model using a simple convolutional neural network (CNN) architecture that has been successfully used for natural language processing tasks such as sentence classification task. By testing our new method on several benchmark datasets, we demonstrate that the combination of FP2VEC and CNN model can achieve competitive results in many QSAR tasks, especially in classification tasks. We also demonstrate that the FP2VEC model is especially effective for multitask learning.
Availability and implementation
FP2VEC is available from https://github.com/wsjeon92/FP2VEC.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Woosung Jeon
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon, Republic of Korea
| | - Dongsup Kim
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Yuseong-gu, Daejeon, Republic of Korea
| |
Collapse
|