1
|
Mukta FT, Rana MM, Meyer A, Ellingson S, Nguyen DD. The algebraic extended atom-type graph-based model for precise ligand-receptor binding affinity prediction. J Cheminform 2025; 17:10. [PMID: 39844277 PMCID: PMC11756177 DOI: 10.1186/s13321-025-00955-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Accepted: 01/10/2025] [Indexed: 01/24/2025] Open
Abstract
Accurate prediction of ligand-receptor binding affinity is crucial in structure-based drug design, significantly impacting the development of effective drugs. Recent advances in machine learning (ML)-based scoring functions have improved these predictions, yet challenges remain in modeling complex molecular interactions. This study introduces the AGL-EAT-Score, a scoring function that integrates extended atom-type multiscale weighted colored subgraphs with algebraic graph theory. This approach leverages the eigenvalues and eigenvectors of graph Laplacian and adjacency matrices to capture high-level details of specific atom pairwise interactions. Evaluated against benchmark datasets such as CASF-2016, CASF-2013, and the Cathepsin S dataset, the AGL-EAT-Score demonstrates notable accuracy, outperforming existing traditional and ML-based methods. The model's strength lies in its comprehensive similarity analysis, examining protein sequence, ligand structure, and binding site similarities, thus ensuring minimal bias and over-representation in the training sets. The use of extended atom types in graph coloring enhances the model's capability to capture the intricacies of protein-ligand interactions. The AGL-EAT-Score marks a significant advancement in drug design, offering a tool that could potentially refine and accelerate the drug discovery process.Scientific Contribution The AGL-EAT-Score presents an algebraic graph-based framework that predicts ligand-receptor binding affinity by constructing multiscale weighted colored subgraphs from the 3D structure of protein-ligand complexes. It improves prediction accuracy by modeling interactions between extended atom types, addressing challenges like dataset bias and over-representation. Benchmark evaluations demonstrate that AGL-EAT-Score outperforms existing methods, offering a robust and systematic tool for structure-based drug design.
Collapse
Affiliation(s)
| | - Md Masud Rana
- Department of Mathematics, Kennesaw State University, Kennesaw, GA, 30144, USA
| | - Avery Meyer
- Department of Mathematics, University of Kentucky, Lexington, KY, 40506, USA
| | - Sally Ellingson
- Division of Biomedical Informatics, College of Medicine, University of Kentucky, Lexington, KY, 40506, USA
| | - Duc D Nguyen
- Department of Mathematics, University of Tennessee, Knoxville, TN, 37996, USA.
| |
Collapse
|
2
|
Huang D, Wang Y, Sun Y, Ji W, Zhang Q, Jiang Y, Qiu H, Liu H, Lu T, Wei X, Chen Y, Zhang Y. A 4D tensor-enhanced multi-dimensional convolutional neural network for accurate prediction of protein-ligand binding affinity. Mol Divers 2024:10.1007/s11030-024-11044-y. [PMID: 39714563 DOI: 10.1007/s11030-024-11044-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 11/02/2024] [Indexed: 12/24/2024]
Abstract
Protein-ligand interactions are the molecular basis of many important cellular activities, such as gene regulation, cell metabolism, and signal transduction. Protein-ligand binding affinity is a crucial metric of the strength of the interaction between the two, and accurate prediction of its binding affinity is essential for discovering drugs' new uses. So far, although many predictive models based on machine learning and deep learning have been reported, most of the models mainly focus on one-dimensional sequence and two-dimensional structural characteristics of proteins and ligands, but fail to deeply explore the detailed interaction information between proteins and ligand atoms in the binding pocket region of three-dimensional space. In this study, we introduced a novel 4D tensor feature to capture key interactions within the binding pocket and developed a three-dimensional convolutional neural network (CNN) model based on this feature. Using ten-fold cross-validation, we identified the optimal parameter combination and pocket size. Additionally, we employed feature engineering to extract features across multiple dimensions, including one-dimensional sequences, two-dimensional structures of the ligand and protein, and three-dimensional interaction features between them. We proposed an efficient protein-ligand binding affinity prediction model MCDTA (multi-dimensional convolutional drug-target affinity), built on a multi-dimensional convolutional neural network framework. Feature ablation experiments revealed that the 4D tensor feature had the most significant impact on model performance. MCDTA performed exceptionally well on the PDBbind v.2020 dataset, achieving an RMSE of 1.231 and a PCC of 0.823. In comparative experiments, it outperformed five other mainstream binding affinity prediction models, with an RMSE of 1.349 and a PCC of 0.795. Moreover, MCDTA demonstrated strong generalization ability and practical screening performance across multiple benchmark datasets, highlighting its reliability and accuracy in predicting protein-ligand binding affinity. The code for MCDTA is available at https://github.com/dfhuang-AI/MCDTA .
Collapse
Affiliation(s)
- Dingfang Huang
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Yu Wang
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Yiming Sun
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Wenhao Ji
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Qing Zhang
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Yunya Jiang
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Haodi Qiu
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Haichun Liu
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
| | - Tao Lu
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China
- State Key Laboratory of Natural Medicines, China Pharmaceutical University, No. 639 Longmian Dadao, Nanjing, 211198, China
| | - Xian Wei
- Department of Pharmacy, Youjiang Medical University for Nationalities, No. 98 ChengXiang Road, Youjiang District, Guangxi Zhuang Autonomous Region, Baise, 533000, China.
| | - Yadong Chen
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China.
| | - Yanmin Zhang
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing, 211198, China.
- Department of Pharmacy, Youjiang Medical University for Nationalities, No. 98 ChengXiang Road, Youjiang District, Guangxi Zhuang Autonomous Region, Baise, 533000, China.
| |
Collapse
|
3
|
Jiang S, Qi B, Xie S, Xie Z, Zhang H, Jiang W. Development and Validation of an Explainable Prediction Model for Postoperative Recurrence in Pediatric Chronic Rhinosinusitis. Otolaryngol Head Neck Surg 2024. [PMID: 39686801 DOI: 10.1002/ohn.1092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 11/08/2024] [Accepted: 11/29/2024] [Indexed: 12/18/2024]
Abstract
OBJECTIVE This study aims to develop an interpretable machine learning (ML) predictive model to assess its efficacy in predicting postoperative recurrence in pediatric chronic rhinosinusitis (CRS). STUDY DESIGN A decision analysis was performed with retrospective clinical data. SETTING Recurrent group and nonrecurrent group. METHODS This retrospective study included 148 pediatric CRS treated with functional endoscopic sinus surgery from January 2015 to January 2022. We collected demographic characteristics and peripheral blood inflammatory indices, and calculated inflammation indices. Models were trained with 3 ML algorithms and compared their predictive performance using the area under the receiver operating characteristic (AUC) curve. Shapley Additive Explanations and Ceteris Paribus profiles were used for model interpretation. The final model was transformed into a web for interactive visualization. RESULTS Among the 3 ML models, the Random Forest (RF) model demonstrated the best discriminative ability (AUC = 0.728). After reducing features based on importance and tuning parameters, the final RF model, including 4 features (systemic immune inflammation index (SII), pan-immune-inflammation value (PIV) and percentage of eosinophils (E%) and lymphocytes (L%)), showed good predictive performance in internal validation (AUC = 0.779). Global interpretation of the model suggested that L% and E% substantially contribute to the overall model. Local interpretation revealed a nonlinear relationship between the included features and model predictions. To enhance its clinical utility, the model was converted into a web (https://juice153.shinyapps.io/CRSRecurrencePrediction/). CONCLUSION Our ML model demonstrated promising accuracy in predicting postoperative recurrence in pediatric CRS, revealing a complex nonlinear relationship between postoperative recurrence and the features SII, PIV, L%, and E%.
Collapse
Affiliation(s)
- Sijie Jiang
- Department of Otolaryngology-Head and Neck Surgery, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Hunan Province Key Laboratory of Otolaryngology Critical Diseases, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Anatomy Laboratory of Division of Nose and Cranial Base, Clinical Anatomy Center of Xiangya Hospital, Xiangya Hospital of Central South University, Changsha, People's Republic of China
| | - Bo Qi
- School of Computer Science and Engineering, Central South University, Changsha, People's Republic of China
| | - Shaobing Xie
- Department of Otolaryngology-Head and Neck Surgery, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Hunan Province Key Laboratory of Otolaryngology Critical Diseases, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Anatomy Laboratory of Division of Nose and Cranial Base, Clinical Anatomy Center of Xiangya Hospital, Xiangya Hospital of Central South University, Changsha, People's Republic of China
| | - Zhihai Xie
- Department of Otolaryngology-Head and Neck Surgery, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Hunan Province Key Laboratory of Otolaryngology Critical Diseases, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Anatomy Laboratory of Division of Nose and Cranial Base, Clinical Anatomy Center of Xiangya Hospital, Xiangya Hospital of Central South University, Changsha, People's Republic of China
| | - Hua Zhang
- Department of Otolaryngology-Head and Neck Surgery, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Hunan Province Key Laboratory of Otolaryngology Critical Diseases, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Anatomy Laboratory of Division of Nose and Cranial Base, Clinical Anatomy Center of Xiangya Hospital, Xiangya Hospital of Central South University, Changsha, People's Republic of China
| | - Weihong Jiang
- Department of Otolaryngology-Head and Neck Surgery, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Hunan Province Key Laboratory of Otolaryngology Critical Diseases, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- National Clinical Research Center for Geriatric Disorders, Xiangya Hospital of Central South University, Changsha, People's Republic of China
- Anatomy Laboratory of Division of Nose and Cranial Base, Clinical Anatomy Center of Xiangya Hospital, Xiangya Hospital of Central South University, Changsha, People's Republic of China
| |
Collapse
|
4
|
Cao D, Chen M, Zhang R, Wang Z, Huang M, Yu J, Jiang X, Fan Z, Zhang W, Zhou H, Li X, Fu Z, Zhang S, Zheng M. SurfDock is a surface-informed diffusion generative model for reliable and accurate protein-ligand complex prediction. Nat Methods 2024:10.1038/s41592-024-02516-y. [PMID: 39604569 DOI: 10.1038/s41592-024-02516-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/16/2024] [Indexed: 11/29/2024]
Abstract
Accurately predicting protein-ligand interactions is crucial for understanding cellular processes. We introduce SurfDock, a deep-learning method that addresses this challenge by integrating protein sequence, three-dimensional structural graphs and surface-level features into an equivariant architecture. SurfDock employs a generative diffusion model on a non-Euclidean manifold, optimizing molecular translations, rotations and torsions to generate reliable binding poses. Our extensive evaluations across various benchmarks demonstrate SurfDock's superiority over existing methods in docking success rates and adherence to physical constraints. It also exhibits remarkable generalizability to unseen proteins and predicted apo structures, while achieving state-of-the-art performance in virtual screening tasks. In a real-world application, SurfDock identified seven novel hit molecules in a virtual screening project targeting aldehyde dehydrogenase 1B1, a key enzyme in cellular metabolism. This showcases SurfDock's ability to elucidate molecular mechanisms underlying cellular processes. These results highlight SurfDock's potential as a transformative tool in structural biology, offering enhanced accuracy, physical plausibility and practical applicability in understanding protein-ligand interactions.
Collapse
Affiliation(s)
- Duanhua Cao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| | - Mingan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
- Lingang Laboratory, Shanghai, China
| | - Runze Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhaokun Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Manlin Huang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- Nanchang University, Nanchang, China
| | - Jie Yu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- Lingang Laboratory, Shanghai, China
- School of Information Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xinyu Jiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhehuan Fan
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Hao Zhou
- Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| | - Zunyun Fu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
| | - Sulin Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Mingyue Zheng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China.
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, China.
- University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
5
|
Hu Q, Wang Z, Meng J, Li W, Guo J, Mu Y, Wang S, Zheng L, Wei Y. OpenDock: a pytorch-based open-source framework for protein-ligand docking and modelling. Bioinformatics 2024; 40:btae628. [PMID: 39432683 PMCID: PMC11552628 DOI: 10.1093/bioinformatics/btae628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/19/2024] [Accepted: 10/19/2024] [Indexed: 10/23/2024] Open
Abstract
MOTIVATION Molecular docking is an invaluable computational tool with broad applications in computer-aided drug design and enzyme engineering. However, current molecular docking tools are typically implemented in languages such as C++ for calculation speed, which lack flexibility and user-friendliness for further development. Moreover, validating the effectiveness of external scoring functions for molecular docking and screening within these frameworks is challenging, and implementing more efficient sampling strategies is not straightforward. RESULTS To address these limitations, we have developed an open-source molecular docking framework, OpenDock, based on Python and PyTorch. This framework supports the integration of multiple scoring functions; some can be utilized during molecular docking and pose optimization, while others can be used for post-processing scoring. In terms of sampling, the current version of this framework supports simulated annealing and Monte Carlo optimization. Additionally, it can be extended to include methods such as genetic algorithms and particle swarm optimization for sampling docking poses and protein side chain orientations. Distance constraints are also implemented to enable covalent docking, restricted docking or distance map constraints guided pose sampling. Overall, this framework serves as a valuable tool in drug design and enzyme engineering, offering significant flexibility for most protein-ligand modelling tasks. AVAILABILITY AND IMPLEMENTATION OpenDock is publicly available at: https://github.com/guyuehuo/opendock.
Collapse
Affiliation(s)
- Qiuyue Hu
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518000, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Zechen Wang
- School of Physics, Shangdong University, Jinan, 250100, China
| | - Jintao Meng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518000, China
| | - Weifeng Li
- School of Physics, Shangdong University, Jinan, 250100, China
| | - Jingjing Guo
- Centre in Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macao SAR, 999078, China
| | - Yuguang Mu
- School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore
| | - Sheng Wang
- Shanghai Zelixir Biotech Co. Ltd, Shanghai, 201203, China
| | | | - Yanjie Wei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518000, China
| |
Collapse
|
6
|
Prat A, Abdel Aty H, Bastas O, Kamuntavičius G, Paquet T, Norvaišas P, Gasparotto P, Tal R. HydraScreen: A Generalizable Structure-Based Deep Learning Approach to Drug Discovery. J Chem Inf Model 2024; 64:5817-5831. [PMID: 39037942 DOI: 10.1021/acs.jcim.4c00481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
We propose HydraScreen, a deep-learning framework for safe and robust accelerated drug discovery. HydraScreen utilizes a state-of-the-art 3D convolutional neural network designed for the effective representation of molecular structures and interactions in protein-ligand binding. We designed an end-to-end pipeline for high-throughput screening and lead optimization, targeting applications in structure-based drug design. We assessed our approach using established public benchmarks based on the CASF-2016 core set, achieving top-tier results in affinity and pose prediction (Pearson's r = 0.86, RMSE = 1.15, Top-1 = 0.95). We introduced a novel approach for interaction profiling, aimed at detecting potential biases within both the model and data sets. This approach not only enhanced interpretability but also reinforced the impartiality of our methodology. Finally, we demonstrated HydraScreen's ability to generalize effectively across novel proteins and ligands through a temporal split. We also provide insights into potential avenues for future development aimed at enhancing the robustness of machine learning scoring functions. HydraScreen (accessible at http://hydrascreen.ro5.ai/paper) provides a user-friendly GUI and a public API, facilitating the easy-access assessment of protein-ligand complexes.
Collapse
Affiliation(s)
- Alvaro Prat
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Hisham Abdel Aty
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Orestis Bastas
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | | | - Tanya Paquet
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Povilas Norvaišas
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Piero Gasparotto
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| | - Roy Tal
- AI Chemistry, Ro5 2801 Gateway Drive, Irving, 75063 Texas, United States
| |
Collapse
|
7
|
Hong Q, Zhou G, Qin Y, Shen J, Li H. SadNet: a novel multimodal fusion network for protein-ligand binding affinity prediction. Phys Chem Chem Phys 2024; 26:12880-12891. [PMID: 38625412 DOI: 10.1039/d3cp05664c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
Protein-ligand binding affinity prediction plays an important role in the field of drug discovery. Existing deep learning-based approaches have significantly improved the efficiency of protein-ligand binding affinity prediction through their excellent inductive bias capability. However, these methods only focus on fragmented three-dimensional data, which truncates the integrity of pocket data, leading to the neglect of potential long-range interactions. In this paper, we propose a dual-stream framework, with amino acid sequence assisting the atomic data fusion for graph neural network (termed SadNet), to fuse both 3D atomic data and sequence data for more accurate prediction results. In detail, SadNet consists of a pocket module and a sequence module. The sequence module expands the "receptive field" of the pocket module through a mid-term virtual node fusion. To better integrate sequence-level information from the sequence module and 3D structural information from the pocket module, we incorporate structural information for each amino acid within the sequence module. Besides, to better understand the intrinsic relationship between sequences and 3D atomic information, our SadNet utilizes information stacking from both the early stage and later stage. Experimental results on publicly available benchmark datasets demonstrate the superiority of the proposed dual-stream approach over the state-of-the-art alternatives. The code of this work is available online at https://github.com/wardhong/SadNet.
Collapse
Affiliation(s)
- Qiansen Hong
- Nanjing University of Posts and Telecommunications, NanJing, China.
| | - Guoqiang Zhou
- Nanjing University of Posts and Telecommunications, NanJing, China.
| | - Yuke Qin
- Nanjing University of Posts and Telecommunications, NanJing, China.
| | - Jun Shen
- University of Wollongong, Australia
| | | |
Collapse
|
8
|
Li X, Shen C, Zhu H, Yang Y, Wang Q, Yang J, Huang N. A High-Quality Data Set of Protein-Ligand Binding Interactions Via Comparative Complex Structure Modeling. J Chem Inf Model 2024; 64:2454-2466. [PMID: 38181418 DOI: 10.1021/acs.jcim.3c01170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2024]
Abstract
High-quality protein-ligand complex structures provide the basis for understanding the nature of noncovalent binding interactions at the atomic level and enable structure-based drug design. However, experimentally determined complex structures are scarce compared with the vast chemical space. In this study, we addressed this issue by constructing the BindingNet data set via comparative complex structure modeling, which contains 69,816 modeled high-quality protein-ligand complex structures with experimental binding affinity data. BindingNet provides valuable insights into investigating protein-ligand interactions, allowing visual inspection and interpretation of structural analogues' structure-activity relationships. It can also be used for evaluating machine-learning-based scoring functions. Our results indicate that machine learning models trained on BindingNet could reduce the bias caused by buried solvent-accessible surface area, as we previously found for models trained on the PDBbind data set. We also discussed strategies to improve BindingNet and its potential utilization for benchmarking the molecular docking methods and ligand binding free energy calculation approaches. The BindingNet complements PDBbind in constructing a sufficient and unbiased protein-ligand binding data set and is freely available at http://bindingnet.huanglab.org.cn.
Collapse
Affiliation(s)
- Xuelian Li
- National Institute of Biological Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Cheng Shen
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Hui Zhu
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing 102206, China
| | - Yujian Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Qing Wang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Jincai Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
| | - Niu Huang
- National Institute of Biological Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing 102206, China
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing 102206, China
| |
Collapse
|
9
|
Qu X, Dong L, Luo D, Si Y, Wang B. Water Network-Augmented Two-State Model for Protein-Ligand Binding Affinity Prediction. J Chem Inf Model 2024; 64:2263-2274. [PMID: 37433009 DOI: 10.1021/acs.jcim.3c00567] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2023]
Abstract
Water network rearrangement from the ligand-unbound state to the ligand-bound state is known to have significant effects on the protein-ligand binding interactions, but most of the current machine learning-based scoring functions overlook these effects. In this study, we endeavor to construct a comprehensive and realistic deep learning model by incorporating water network information into both ligand-unbound and -bound states. In particular, extended connectivity interaction features were integrated into graph representation, and graph transformer operator was employed to extract features of the ligand-unbound and -bound states. Through these efforts, we developed a water network-augmented two-state model called ECIFGraph::HM-Holo-Apo. Our new model exhibits satisfactory performance in terms of scoring, ranking, docking, screening, and reverse screening power tests on the CASF-2016 benchmark. In addition, it can achieve superior performance in large-scale docking-based virtual screening tests on the DEKOIS2.0 data set. Our study highlights that the use of a water network-augmented two-state model can be an effective strategy to bolster the robustness and applicability of machine learning-based scoring functions, particularly for targets with hydrophilic or solvent-exposed binding pockets.
Collapse
Affiliation(s)
- Xiaoyang Qu
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Lina Dong
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Ding Luo
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Yubing Si
- College of Chemistry, Zhengzhou University, Zhengzhou 450001, P. R. China
| | - Binju Wang
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen 361005, P. R. China
| |
Collapse
|
10
|
Zhang X, Gao H, Wang H, Chen Z, Zhang Z, Chen X, Li Y, Qi Y, Wang R. PLANET: A Multi-objective Graph Neural Network Model for Protein-Ligand Binding Affinity Prediction. J Chem Inf Model 2024; 64:2205-2220. [PMID: 37319418 DOI: 10.1021/acs.jcim.3c00253] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Predicting protein-ligand binding affinity is a central issue in drug design. Various deep learning models have been published in recent years, where many of them rely on 3D protein-ligand complex structures as input and tend to focus on the single task of reproducing binding affinity. In this study, we have developed a graph neural network model called PLANET (Protein-Ligand Affinity prediction NETwork). This model takes the graph-represented 3D structure of the binding pocket on the target protein and the 2D chemical structure of the ligand molecule as input. It was trained through a multi-objective process with three related tasks, including deriving the protein-ligand binding affinity, protein-ligand contact map, and ligand distance matrix. Besides the protein-ligand complexes with known binding affinity data retrieved from the PDBbind database, a large number of non-binder decoys were also added to the training data for deriving the final model of PLANET. When tested on the CASF-2016 benchmark, PLANET exhibited a scoring power comparable to the best result yielded by other deep learning models as well as a reasonable ranking power and docking power. In virtual screening trials conducted on the DUD-E benchmark, PLANET's performance was notably better than several deep learning and machine learning models. As on the LIT-PCBA benchmark, PLANET achieved comparable accuracy as the conventional docking program Glide, but it only spent less than 1% of Glide's computation time to finish the same job because PLANET did not need exhaustive conformational sampling. Considering the decent accuracy and efficiency of PLANET in binding affinity prediction, it may become a useful tool for conducting large-scale virtual screening.
Collapse
Affiliation(s)
- Xiangying Zhang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Haotian Gao
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Haojie Wang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Zhihang Chen
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Zhe Zhang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Xinchong Chen
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Yan Li
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Yifei Qi
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| | - Renxiao Wang
- Department of Medicinal Chemistry, School of Pharmacy, Fudan University, 826 Zhangheng Road, Shanghai 201203, People's Republic of China
| |
Collapse
|
11
|
Luo D, Liu D, Qu X, Dong L, Wang B. Enhancing Generalizability in Protein-Ligand Binding Affinity Prediction with Multimodal Contrastive Learning. J Chem Inf Model 2024; 64:1892-1906. [PMID: 38441880 DOI: 10.1021/acs.jcim.3c01961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
Improving the generalization ability of scoring functions remains a major challenge in protein-ligand binding affinity prediction. Many machine learning methods are limited by their reliance on single-modal representations, hindering a comprehensive understanding of protein-ligand interactions. We introduce a graph-neural-network-based scoring function that utilizes a triplet contrastive learning loss to improve protein-ligand representations. In this model, three-dimensional complex representations and the fusion of two-dimensional ligand and coarse-grained pocket representations converge while distancing from decoy representations in latent space. After rigorous validation on multiple external data sets, our model exhibits commendable generalization capabilities compared to those of other deep learning-based scoring functions, marking it as a promising tool in the realm of drug discovery. In the future, our training framework can be extended to other biophysical- and biochemical-related problems such as protein-protein interaction and protein mutation prediction.
Collapse
Affiliation(s)
- Ding Luo
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Dandan Liu
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Xiaoyang Qu
- School of Pharmacy and Medical Technology, Putian University, Putian 351100, P. R. China
- Key Laboratory of Pharmaceutical Analysis and Laboratory Medicine (Putian University), Fujian Province University, Putian 351100, P. R. China
| | - Lina Dong
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Binju Wang
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen 361005, P. R. China
| |
Collapse
|
12
|
Tang Y, Moretti R, Meiler J. Recent Advances in Automated Structure-Based De Novo Drug Design. J Chem Inf Model 2024; 64:1794-1805. [PMID: 38485516 PMCID: PMC10966644 DOI: 10.1021/acs.jcim.4c00247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2024] [Revised: 02/26/2024] [Accepted: 02/29/2024] [Indexed: 03/26/2024]
Abstract
As the number of determined and predicted protein structures and the size of druglike 'make-on-demand' libraries soar, the time-consuming nature of structure-based computer-aided drug design calls for innovative computational algorithms. De novo drug design introduces in silico heuristics to accelerate searching in the vast chemical space. This review focuses on recent advances in structure-based de novo drug design, ranging from conventional fragment-based methods, evolutionary algorithms, and Metropolis Monte Carlo methods to deep generative models. Due to the historical limitation of de novo drug design generating readily available drug-like molecules, we highlight the synthetic accessibility efforts in each category and the benchmarking strategies taken to validate the proposed framework.
Collapse
Affiliation(s)
- Yidan Tang
- Department
of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, United States
| | - Rocco Moretti
- Department
of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, United States
- Center
for Structural Biology, Vanderbilt University, Nashville, Tennessee 37240, United States
| | - Jens Meiler
- Department
of Chemistry, Vanderbilt University, Nashville, Tennessee 37235, United States
- Center
for Structural Biology, Vanderbilt University, Nashville, Tennessee 37240, United States
- Institute
of Drug Discovery, Faculty of Medicine, University of Leipzig, 04103 Leipzig, Germany
| |
Collapse
|
13
|
Niu Q, Wang G, Liu B, Zhang R, Lei J, Wang H, Liu M. Selection and prediction of metro station sites based on spatial data and random forest: a study of Lanzhou, China. Sci Rep 2023; 13:22542. [PMID: 38110563 PMCID: PMC10728089 DOI: 10.1038/s41598-023-49877-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Accepted: 12/13/2023] [Indexed: 12/20/2023] Open
Abstract
Urban economic development, congestion relief, and traffic efficiency are all greatly impacted by the thoughtful planning of urban metro station layout. with the urban area of Lanzhou as an example, the suitability of the station locations of the built metro stations of the rail transit lines 1 and 2 in the study area have been evaluated using multi-source heterogeneous spatial data through data collection, feature matrix construction, the use of random forest and K-fold cross-validation, among other methods. The average Gini reduction value was used to examine the contribution rate of each feature indicator based on the examination of model truthfulness. According to the study's findings: (1) K-fold cross-validation was applied to test the random forest model that was built using the built metro stations and particular factors. The average accuracy of the tests and out-of-bag data (OOB) of tenfold cross-validation were 89.62% and 91.285%, respectively. Additionally, the AUC area under the ROC curve was 0.9823, indicating that this time, from the perspective of the natural environment, traffic location, and social factors The 19 elements selected from the views of the urban function structure, social economics, and natural environment are closely associated to the locations of the metro station in the research region, and the prediction the findings are more reliable; (2) It becomes apparent that more than half of the built station sites display excellent agreement with the predicted sites in terms of geographical location by superimposing the built metro station sites with the prediction results and tally up their cumulative prediction probability values within the 300 m buffering zone; (3) Based on the contribution rate of each indicator to the model, transport facilities, companies, population density, night lighting, science, education and culture, residential communities, and road network density are identified as the primary influential factors, each accounting for over 6.6%. Subsequently, land use, elevation, and slope are found to have relatively lower contributions. The results of the research provided important information for the local metro's best location selection and planning.
Collapse
Affiliation(s)
- Quanfu Niu
- School of Civil Engineering, Lanzhou University of Technology, Lanzhou, 730050, China.
- Emergency Mapping Engineering Research Center of Gansu Province, Lanzhou, 730050, China.
- Academician Expert Workstation of Gansu Dayu Jiu Zhou Space Information Technology Co., Ltd, Lanzhou, 730050, China.
| | - Gang Wang
- School of Civil Engineering, Lanzhou University of Technology, Lanzhou, 730050, China
| | - Bo Liu
- School of Civil Engineering, Lanzhou University of Technology, Lanzhou, 730050, China
| | - Ruizhen Zhang
- School of Civil Engineering, Lanzhou University of Technology, Lanzhou, 730050, China
| | - Jiaojiao Lei
- School of Civil Engineering, Lanzhou University of Technology, Lanzhou, 730050, China
| | - Hao Wang
- School of Civil Engineering, Lanzhou University of Technology, Lanzhou, 730050, China
| | - Mingzhi Liu
- School of Civil Engineering, Lanzhou University of Technology, Lanzhou, 730050, China
| |
Collapse
|
14
|
Shen T, Liu F, Wang Z, Sun J, Bu Y, Meng J, Chen W, Yao K, Mu Y, Li W, Zhao G, Wang S, Wei Y, Zheng L. zPoseScore model for accurate and robust protein-ligand docking pose scoring in CASP15. Proteins 2023; 91:1837-1849. [PMID: 37606194 DOI: 10.1002/prot.26573] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 07/20/2023] [Accepted: 07/31/2023] [Indexed: 08/23/2023]
Abstract
We introduce a deep learning-based ligand pose scoring model called zPoseScore for predicting protein-ligand complexes in the 15th Critical Assessment of Protein Structure Prediction (CASP15). Our contributions are threefold: first, we generate six training and evaluation data sets by employing advanced data augmentation and sampling methods. Second, we redesign the "zFormer" module, inspired by AlphaFold2's Evoformer, to efficiently describe protein-ligand interactions. This module enables the extraction of protein-ligand paired features that lead to accurate predictions. Finally, we develop the zPoseScore framework with zFormer for scoring and ranking ligand poses, allowing for atomic-level protein-ligand feature encoding and fusion to output refined ligand poses and ligand per-atom deviations. Our results demonstrate excellent performance on various testing data sets, achieving Pearson's correlation R = 0.783 and 0.659 for ranking docking decoys generated based on experimental and predicted protein structures of CASF-2016 protein-ligand complexes. Additionally, we obtain an averaged local distance difference test (lDDT pli = 0.558) of AIchemy LIG2 in CASP15 for de novo protein-ligand complex structure predictions. Detailed analysis shows that accurate ligand binding site prediction and side-chain orientation are crucial for achieving better prediction performance. Our proposed model is one of the most accurate protein-ligand pose prediction models and could serve as a valuable tool in small molecule drug discovery.
Collapse
Affiliation(s)
- Tao Shen
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Fuxu Liu
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Zechen Wang
- School of Physics, Shandong University, Jinan, Shandong, China
| | - Jinyuan Sun
- Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Yifan Bu
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Jintao Meng
- Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Weihua Chen
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Keyi Yao
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Yuguang Mu
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Weifeng Li
- School of Physics, Shandong University, Jinan, Shandong, China
| | - Guoping Zhao
- Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Sheng Wang
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
| | - Yanjie Wei
- Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Liangzhen Zheng
- Shanghai Zelixir Biotech Company Ltd., Shanghai, China
- Shenzhen Key Laboratory of Intelligent Bioinformatics, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| |
Collapse
|
15
|
Zhao X, Li H, Zhang K, Huang SY. Iterative Knowledge-Based Scoring Function for Protein-Ligand Interactions by Considering Binding Affinity Information. J Phys Chem B 2023; 127:9021-9034. [PMID: 37822259 DOI: 10.1021/acs.jpcb.3c04421] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/13/2023]
Abstract
Scoring functions for protein-ligand interactions play a critical role in structure-based drug design. Owing to the good balance between general applicability and computational efficiency, knowledge-based scoring functions have obtained significant advancements and achieved many successes. Nevertheless, knowledge-based scoring functions face a challenge in utilizing the experimental affinity data and thus may not perform well in binding affinity prediction. Addressing the challenge, we have proposed an improved version of the iterative knowledge-based scoring function ITScore by considering binding affinity information, which is referred to as ITScoreAff, based on a large training set of 6216 protein-ligand complexes with both structures and affinity data. ITScoreAff was extensively evaluated and compared with ITScore, 33 traditional, and 6 machine learning scoring functions in terms of docking power, ranking power, and screening power on the independent CASF-2016 benchmark. It was shown that ITScoreAff obtained an overall better performance than the other 40 scoring functions and gave an average success rate of 85.3% in docking power, a correlation coefficient of 0.723 in scoring power, and an average rank correlation coefficient of 0.668 in ranking power. In addition, ITScoreAff also achieved the overall best screening power when the top 10% of the ranked database were considered. These results demonstrated the robustness of ITScoreAff and its improvement over existing scoring functions.
Collapse
Affiliation(s)
- Xuejun Zhao
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
| | - Hao Li
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
| | - Keqiong Zhang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
| | - Sheng-You Huang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, P. R. China
| |
Collapse
|
16
|
Dong L, Shi S, Qu X, Luo D, Wang B. Ligand binding affinity prediction with fusion of graph neural networks and 3D structure-based complex graph. Phys Chem Chem Phys 2023; 25:24110-24120. [PMID: 37655493 DOI: 10.1039/d3cp03651k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
Abstract
Accurate prediction of protein-ligand binding affinity is pivotal for drug design and discovery. Here, we proposed a novel deep fusion graph neural networks framework named FGNN to learn the protein-ligand interactions from the 3D structures of protein-ligand complexes. Unlike 1D sequences for proteins or 2D graphs for ligands, the 3D graph of protein-ligand complex enables the more accurate representations of the protein-ligand interactions. Benchmark studies have shown that our fusion models FGNN can achieve more accurate prediction of binding affinity than any individual algorithm. The advantages of fusion strategies have been demonstrated in terms of expressive power of data, learning efficiency and model interpretability. Our fusion models show satisfactory performances on diverse data sets, demonstrating their generalization ability. Given the good performances in both binding affinity prediction and virtual screening, our fusion models are expected to be practically applied for drug screening and design. Our work highlights the potential of the fusion graph neural network algorithm in solving complex prediction problems in computational biology and chemistry. The fusion graph neural networks (FGNN) model is freely available in https://github.com/LinaDongXMU/FGNN.
Collapse
Affiliation(s)
- Lina Dong
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, China.
| | - Shuai Shi
- Department of Algorithm, TuringQ Co., Ltd., Shanghai, 200240, China
| | - Xiaoyang Qu
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, China.
| | - Ding Luo
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, China.
| | - Binju Wang
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen, 361005, China.
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen, 361005, China
| |
Collapse
|
17
|
Spiers RC, Norby C, Kalivas JH. Physicochemical Responsive Integrated Similarity Measure (PRISM) for a Comprehensive Quantitative Perspective of Sample Similarity Dynamically Assessed with NIR Spectra. Anal Chem 2023; 95:12776-12784. [PMID: 37594455 DOI: 10.1021/acs.analchem.3c01616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/19/2023]
Abstract
Determining sample similarity underlies many foundational principles in analytical chemistry. For example, calibration models are unsuitable to predict outliers. Calibration transfer methods assume a moderate degree of sample and measurement dissimilarities between a calibration set and target prediction samples. Classification approaches link target sample similarities to groups of similar class samples. Although similarity is ubiquitous in analytical chemistry and everyday life, quantifying sample similarity is without a straightforward solution, especially when target domain samples are unlabeled and the only known features are measurable, such as spectra (the focus of this paper). The process proposed to assess sample similarity integrates spectral similarity information with contextual considerations among source analyte contents, model, and analyte predictions. This hybrid approach named the physicochemical responsive integrated similarity measure (PRISM) amplifies hidden-but-essential physicochemical properties encoded within respective spectra. PRISM is tested on four near-infrared (NIR) data sets for four diverse application areas to show efficacy. These applications are the assessment of prediction reliability and model updating for model generalizability, outlier detection, and basic matrix matching evaluation. Discussion is provided on adapting PRISM to classification problems. Results indicate that PRISM collects large amounts of similarity information and effectively integrates it to produce a quantitative similarity evaluation between the target sample and a source domain. The approach is also useful for biological samples with additional physiochemical variations. While PRISM is dynamically tested on NIR data, parts of PRISM were previously applied to other data types, and PRISM should be applicable to other measurement systems perturbed by matrix effects.
Collapse
Affiliation(s)
- Robert C Spiers
- Department of Chemistry, Idaho State University, Pocatello, Idaho 83209, United States
| | - Callan Norby
- Department of Chemistry, Idaho State University, Pocatello, Idaho 83209, United States
| | - John H Kalivas
- Department of Chemistry, Idaho State University, Pocatello, Idaho 83209, United States
| |
Collapse
|
18
|
Shen C, Zhang X, Hsieh CY, Deng Y, Wang D, Xu L, Wu J, Li D, Kang Y, Hou T, Pan P. A generalized protein-ligand scoring framework with balanced scoring, docking, ranking and screening powers. Chem Sci 2023; 14:8129-8146. [PMID: 37538816 PMCID: PMC10395315 DOI: 10.1039/d3sc02044d] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Accepted: 07/03/2023] [Indexed: 08/05/2023] Open
Abstract
Applying machine learning algorithms to protein-ligand scoring functions has aroused widespread attention in recent years due to the high predictive accuracy and affordable computational cost. Nevertheless, most machine learning-based scoring functions are only applicable to a specific task, e.g., binding affinity prediction, binding pose prediction or virtual screening, suggesting that the development of a scoring function with balanced performance in all critical tasks remains a grand challenge. To this end, we propose a novel parameterization strategy by introducing an adjustable binding affinity term that represents the correlation between the predicted outcomes and experimental data into the training of mixture density network. The resulting residue-atom distance likelihood potential not only retains the superior docking and screening power over all the other state-of-the-art approaches, but also achieves a remarkable improvement in scoring and ranking performance. We emphatically explore the impacts of several key elements on prediction accuracy as well as the task preference, and demonstrate that the performance of scoring/ranking and docking/screening tasks of a certain model could be well balanced through an appropriate manner. Overall, our study highlights the potential utility of our innovative parameterization strategy as well as the resulting scoring framework in future structure-based drug design.
Collapse
Affiliation(s)
- Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
- State Key Lab of CAD&CG, Zhejiang University Hangzhou 310058 Zhejiang China
- School of Public Health, Zhejiang University Hangzhou 310058 Zhejiang China
- CarbonSilicon AI Technology Co., Ltd Hangzhou 310018 Zhejiang China
| | - Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd Hangzhou 310018 Zhejiang China
| | - Dong Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology Changzhou 213001 China
| | - Jian Wu
- School of Public Health, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Dan Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
- State Key Lab of CAD&CG, Zhejiang University Hangzhou 310058 Zhejiang China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University Hangzhou 310058 Zhejiang China
| |
Collapse
|
19
|
Londono J, Ghasemi S, Hussain Shah A, Fahimipour A, Ghadimi N, Hashemi S, Khurshid Z, Dashti M. Evaluation of deep learning and convolutional neural network algorithms accuracy for detecting and predicting anatomical landmarks on 2D lateral cephalometric images: A systematic review and meta-analysis. Saudi Dent J 2023; 35:487-497. [PMID: 37520606 PMCID: PMC10373073 DOI: 10.1016/j.sdentj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 05/15/2023] [Accepted: 05/17/2023] [Indexed: 08/01/2023] Open
Abstract
Introduction Cephalometry is the study of skull measurements for clinical evaluation, diagnosis, and surgical planning. Machine learning (ML) algorithms have been used to accurately identify cephalometric landmarks and detect irregularities related to orthodontics and dentistry. ML-based cephalometric imaging reduces errors, improves accuracy, and saves time. Method In this study, we conducted a meta-analysis and systematic review to evaluate the accuracy of ML software for detecting and predicting anatomical landmarks on two-dimensional (2D) lateral cephalometric images. The meta-analysis followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines for selecting and screening research articles. The eligibility criteria were established based on the diagnostic accuracy and prediction of ML combined with 2D lateral cephalometric imagery. The search was conducted among English articles in five databases, and data were managed using Review Manager software (v. 5.0). Quality assessment was performed using the diagnostic accuracy studies (QUADAS-2) tool. Result Summary measurements included the mean departure from the 1-4-mm threshold or the percentage of landmarks identified within this threshold with a 95% confidence interval (CI). This meta-analysis included 21 of 577 articles initially collected on the accuracy of ML algorithms for detecting and predicting anatomical landmarks. The studies were conducted in various regions of the world, and 20 of the studies employed convolutional neural networks (CNNs) for detecting cephalometric landmarks. The pooled successful detection rates for the 1-mm, 2-mm, 2.5-mm, 3-mm, and 4-mm ranges were 65%, 81%, 86%, 91%, and 96%, respectively. Heterogeneity was determined using the random effect model. Conclusion In conclusion, ML has shown promise for landmark detection in 2D cephalometric imagery, although the accuracy has varied among studies and clinicians. Consequently, more research is required to determine its effectiveness and reliability in clinical settings.
Collapse
Affiliation(s)
- Jimmy Londono
- FACP, Professor and Director of the Prosthodontics Residency Program and the Ronald Goldstein Center for Esthetics and Implant Dentistry, Dental College of Georgia at Augusta University, Augusta, GA, United States
| | - Shohreh Ghasemi
- Department of Oral and Maxillofacial Surgery, The Dental College of Georgia at Augusta University, Augusta, GA, United States
| | - Altaf Hussain Shah
- Special Care Dentistry Clinics, University Dental Hospital, King Saud University Medical City, Riyadh, Saudi Arabia
| | - Amir Fahimipour
- School of Dentistry, Faculty of Medicine and Health, Westmead Centre for Oral Health, The University of Sydney, NSW 2145, Australia
| | - Niloofar Ghadimi
- Department of Oral and Maxillofacial Radiology, Dental School, Islamic Azad University of Medical Sciences, Tehran, Iran
| | - Sara Hashemi
- School of Dentistry, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Zohaib Khurshid
- Department of Prosthodontics and Dental Implantology, King Faisal University, Al-Ahsa 31982, Saudi Arabia
- Center of Excellence for Regenerative Dentistry, Department of Anatomy, Faculty of Dentistry, Chulalongkorn University, Bangkok 10330, Thailand
| | - Mahmood Dashti
- School of Dentistry, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| |
Collapse
|
20
|
Rayka M, Firouzi R. GB-score: Minimally designed machine learning scoring function based on distance-weighted interatomic contact features. Mol Inform 2023; 42:e2200135. [PMID: 36722733 DOI: 10.1002/minf.202200135] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 11/24/2022] [Accepted: 11/28/2022] [Indexed: 02/02/2023]
Abstract
In recent years, thanks to advances in computer hardware and dataset availability, data-driven approaches (like machine learning) have become one of the essential parts of the drug design framework to accelerate drug discovery procedures. Constructing a new scoring function, a function that can predict the binding score for a generated protein-ligand pose during docking procedure or a crystal complex, based on machine and deep learning has become an active research area in computer-aided drug design. GB-Score is a state-of-the-art machine learning-based scoring function that utilizes distance-weighted interatomic contact features, PDBbind-v2019 general set, and Gradient Boosting Trees algorithm to the binding affinity prediction. The distance-weighted interatomic contact featurization method used the distance between different ligand and protein atom types for numerical representation of the protein-ligand complex. GB-Score attains Pearson's correlation 0.862 and RMSE 1.190 on the CASF-2016 benchmark test in the scoring power metric. GB-Score's codes are freely available on the web at https://github.com/miladrayka/GB_Score.
Collapse
Affiliation(s)
- Milad Rayka
- Department of Physical Chemistry, Chemistry and Chemical Engineering Research Center of Iran, Tehran, Iran
| | - Rohoullah Firouzi
- Department of Physical Chemistry, Chemistry and Chemical Engineering Research Center of Iran, Tehran, Iran
| |
Collapse
|
21
|
New avenues in artificial-intelligence-assisted drug discovery. Drug Discov Today 2023; 28:103516. [PMID: 36736583 DOI: 10.1016/j.drudis.2023.103516] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Revised: 12/08/2022] [Accepted: 01/26/2023] [Indexed: 02/05/2023]
Abstract
Over the past decade, the amount of biomedical data available has grown at unprecedented rates. Increased automation technology and larger data volumes have encouraged the use of machine learning (ML) or artificial intelligence (AI) techniques for mining such data and extracting useful patterns. Because the identification of chemical entities with desired biological activity is a crucial task in drug discovery, AI technologies have the potential to accelerate this process and support decision making. In addition, the advent of deep learning (DL) has shown great promise in addressing diverse problems in drug discovery, such as de novo molecular design. Herein, we will appraise the current state-of-the-art in AI-assisted drug discovery, discussing the recent applications covering generative models for chemical structure generation, scoring functions to improve binding affinity and pose prediction, and molecular dynamics to assist in the parametrization, featurization and generalization tasks. Finally, we will discuss current hurdles and the strategies to overcome them, as well as potential future directions.
Collapse
|
22
|
Wang Z, Zheng L, Wang S, Lin M, Wang Z, Kong AWK, Mu Y, Wei Y, Li W. A fully differentiable ligand pose optimization framework guided by deep learning and a traditional scoring function. Brief Bioinform 2023; 24:6887112. [PMID: 36502369 DOI: 10.1093/bib/bbac520] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 10/17/2022] [Accepted: 10/31/2022] [Indexed: 12/14/2022] Open
Abstract
The recently reported machine learning- or deep learning-based scoring functions (SFs) have shown exciting performance in predicting protein-ligand binding affinities with fruitful application prospects. However, the differentiation between highly similar ligand conformations, including the native binding pose (the global energy minimum state), remains challenging that could greatly enhance the docking. In this work, we propose a fully differentiable, end-to-end framework for ligand pose optimization based on a hybrid SF called DeepRMSD+Vina combined with a multi-layer perceptron (DeepRMSD) and the traditional AutoDock Vina SF. The DeepRMSD+Vina, which combines (1) the root mean square deviation (RMSD) of the docking pose with respect to the native pose and (2) the AutoDock Vina score, is fully differentiable; thus is capable of optimizing the ligand binding pose to the energy-lowest conformation. Evaluated by the CASF-2016 docking power dataset, the DeepRMSD+Vina reaches a success rate of 94.4%, which outperforms most reported SFs to date. We evaluated the ligand conformation optimization framework in practical molecular docking scenarios (redocking and cross-docking tasks), revealing the high potentialities of this framework in drug design and discovery. Structural analysis shows that this framework has the ability to identify key physical interactions in protein-ligand binding, such as hydrogen-bonding. Our work provides a paradigm for optimizing ligand conformations based on deep learning algorithms. The DeepRMSD+Vina model and the optimization framework are available at GitHub repository https://github.com/zchwang/DeepRMSD-Vina_Optimization.
Collapse
Affiliation(s)
- Zechen Wang
- School of Physics, Shandong University, Jinan, Shandong 250100, China
| | - Liangzhen Zheng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China.,Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Sheng Wang
- Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Mingzhi Lin
- Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Zhihao Wang
- School of Physics, Shandong University, Jinan, Shandong 250100, China
| | - Adams Wai-Kin Kong
- Rolls-Royce Corporate Lab, Nanyang Technological University, Singapore 637551, Singapore
| | - Yuguang Mu
- School of Biological Sciences, Nanyang Technological University, Singapore 637551, Singapore
| | - Yanjie Wei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
| | - Weifeng Li
- School of Physics, Shandong University, Jinan, Shandong 250100, China
| |
Collapse
|
23
|
Kanakala G, Aggarwal R, Nayar D, Priyakumar UD. Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets. ACS OMEGA 2023; 8:2389-2397. [PMID: 36687059 PMCID: PMC9850481 DOI: 10.1021/acsomega.2c06781] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein-ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein-ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein-ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability.
Collapse
Affiliation(s)
| | - Rishal Aggarwal
- International
Institute of Information Technology, Hyderabad500 032, India
| | - Divya Nayar
- Department
of Materials Science and Engineering, Indian
Institute of Technology Delhi, Hauz Khas, New Delhi110016, India
| | - U. Deva Priyakumar
- International
Institute of Information Technology, Hyderabad500 032, India
| |
Collapse
|
24
|
Assessing How Residual Errors of Scoring Functions Correlate to Ligand Structural Features. Int J Mol Sci 2022; 23:ijms232315018. [PMID: 36499344 PMCID: PMC9739603 DOI: 10.3390/ijms232315018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2022] [Revised: 11/08/2022] [Accepted: 11/10/2022] [Indexed: 12/02/2022] Open
Abstract
Scoring functions (SFs) are ubiquitous tools for early stage drug discovery. However, their accuracy currently remains quite moderate. Despite a number of successful target-specific SFs appearing recently, up until now, no ideas on how to systematically improve the general scope of SFs have been formulated. In this work, we hypothesized that the specific features of ligands, corresponding to interactions well appreciated by medicinal chemists (e.g., hydrogen bonds, hydrophobic and aromatic interactions), might be responsible, in part, for the remaining SF errors. The latter provides direction to efforts aimed at the rational and systematic improvement of SF accuracy. In this proof-of-concept work, we took a CASF-2016 coreset of 285 ligands as a basis for comparison and calculated the values of scores for a representative panel of SFs (including AutoDock 4.2, AutoDock Vina, X-Score, NNScore2.0, ΔVina RF20, and DSX). The residual error of linear correlation of each SF value, with the experimental values of affinity and activity, was then analyzed in terms of its correlation with the presence of the fragments responsible for certain medicinal chemistry defined interactions. We showed that, despite the fact that SFs generally perform reasonably, there is room for improvement in terms of better parameterization of interactions involving certain fragments in ligands. Thus, this approach opens a potential way for the systematic improvement of SFs without their significant complication. However, the straightforward application of the proposed approach is limited by the scarcity of reliable available data for ligand-receptor complexes, which is a common problem in the field.
Collapse
|
25
|
Boyles F, Deane CM, Morris GM. Learning from Docked Ligands: Ligand-Based Features Rescue Structure-Based Scoring Functions When Trained on Docked Poses. J Chem Inf Model 2022; 62:5329-5341. [PMID: 34469150 DOI: 10.1021/acs.jcim.1c00096] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Machine learning scoring functions for protein-ligand binding affinity have been found to consistently outperform classical scoring functions when trained and tested on crystal structures of bound protein-ligand complexes. However, it is less clear how these methods perform when applied to docked poses of complexes. We explore how the use of docked rather than crystallographic poses for both training and testing affects the performance of machine learning scoring functions. Using the PDBbind Core Sets as benchmarks, we show that the performance of a structure-based machine learning scoring function trained and tested on docked poses is lower than that of the same scoring function trained and tested on crystallographic poses. We construct a hybrid scoring function by combining both structure-based and ligand-based features, and show that its ability to predict binding affinity using docked poses is comparable to that of purely structure-based scoring functions trained and tested on crystal poses. We also present a new, freely available validation set─the Updated DUD-E Diverse Subset─for binding affinity prediction using data from DUD-E and ChEMBL. Despite strong performance on docked poses of the PDBbind Core Sets, we find that our hybrid scoring function sometimes generalizes poorly to a protein target not represented in the training set, demonstrating the need for improved scoring functions and additional validation benchmarks.
Collapse
Affiliation(s)
- Fergus Boyles
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, United Kingdom
| | - Charlotte M Deane
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, United Kingdom
| | - Garrett M Morris
- Department of Statistics, University of Oxford, 24-29 St Giles', Oxford, OX1 3LB, United Kingdom
| |
Collapse
|
26
|
Zhu H, Yang J, Huang N. Assessment of the Generalization Abilities of Machine-Learning Scoring Functions for Structure-Based Virtual Screening. J Chem Inf Model 2022; 62:5485-5502. [PMID: 36268980 DOI: 10.1021/acs.jcim.2c01149] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
In structure-based virtual screening (SBVS), it is critical that scoring functions capture protein-ligand atomic interactions. By focusing on the local domains of ligand binding pockets, a standardized pocket Pfam-based clustering (Pfam-cluster) approach was developed to assess the cross-target generalization ability of machine-learning scoring functions (MLSFs). Subsequently, 12 typical MLSFs were evaluated using random cross-validation (Random-CV), protein sequence similarity-based cross-validation (Seq-CV), and pocket Pfam-based cross-validation (Pfam-CV) methods. Surprisingly, all of the tested models showed decreased performances from Random-CV to Seq-CV to Pfam-CV experiments, not showing satisfactory generalization capacity. Our interpretable analysis suggested that the predictions on novel targets by MLSFs were dependent on buried solvent-accessible surface area (SASA)-related features of complex structures, with greater predicted binding affinities on complexes owning larger protein-ligand interfaces. By combining buried SASA-related features with target-specific patterns that were only shared among structurally similar compounds in the same cluster, the random forest (RF)-Score attained a good performance in the Random-CV test. Based on these findings, we strongly advise assessing the generalization ability of MLSFs with the Pfam-cluster approach and being cautious with the features learned by MLSFs.
Collapse
Affiliation(s)
- Hui Zhu
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.,National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| | - Jincai Yang
- National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| | - Niu Huang
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua University, Beijing, China102206, China.,National Institute of Biological Sciences, 7 Science Park Road, Zhongguancun Life Science Park, Beijing102206, China
| |
Collapse
|
27
|
Shen C, Zhang X, Deng Y, Gao J, Wang D, Xu L, Pan P, Hou T, Kang Y. Boosting Protein-Ligand Binding Pose Prediction and Virtual Screening Based on Residue-Atom Distance Likelihood Potential and Graph Transformer. J Med Chem 2022; 65:10691-10706. [PMID: 35917397 DOI: 10.1021/acs.jmedchem.2c00991] [Citation(s) in RCA: 62] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
The past few years have witnessed enormous progress toward applying machine learning approaches to the development of protein-ligand scoring functions. However, the robust performance and wide applicability of scoring functions remain a big challenge for increasing the success rate of docking-based virtual screening. Herein, a novel scoring function named RTMScore was developed by introducing a tailored residue-based graph representation strategy and several graph transformer layers for the learning of protein and ligand representations, followed by a mixture density network to obtain residue-atom distance likelihood potential. Our approach was resolutely validated on the CASF-2016 benchmark, and the results indicate that RTMScore can outperform almost all of the other state-of-the-art methods in terms of both the docking and screening powers. Further evaluation confirms the robustness of our approach that can not only retain its docking power on cross-docked poses but also achieve improved performance as a rescoring tool in larger-scale virtual screening.
Collapse
Affiliation(s)
- Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058, China.,CarbonSilicon AI Technology Co., Ltd, Hangzhou, Zhejiang 310018, China
| | - Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yafeng Deng
- CarbonSilicon AI Technology Co., Ltd, Hangzhou, Zhejiang 310018, China
| | - Junbo Gao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Dong Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Peichen Pan
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang 310058, China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang 310058, China
| |
Collapse
|
28
|
Yang C, Chen EA, Zhang Y. Protein-Ligand Docking in the Machine-Learning Era. Molecules 2022; 27:4568. [PMID: 35889440 PMCID: PMC9323102 DOI: 10.3390/molecules27144568] [Citation(s) in RCA: 41] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 07/14/2022] [Indexed: 11/16/2022] Open
Abstract
Molecular docking plays a significant role in early-stage drug discovery, from structure-based virtual screening (VS) to hit-to-lead optimization, and its capability and predictive power is critically dependent on the protein-ligand scoring function. In this review, we give a broad overview of recent scoring function development, as well as the docking-based applications in drug discovery. We outline the strategies and resources available for structure-based VS and discuss the assessment and development of classical and machine learning protein-ligand scoring functions. In particular, we highlight the recent progress of machine learning scoring function ranging from descriptor-based models to deep learning approaches. We also discuss the general workflow and docking protocols of structure-based VS, such as structure preparation, binding site detection, docking strategies, and post-docking filter/re-scoring, as well as a case study on the large-scale docking-based VS test on the LIT-PCBA data set.
Collapse
Affiliation(s)
- Chao Yang
- Department of Chemistry, New York University, New York, NY 10003, USA; (C.Y.); (E.A.C.)
| | - Eric Anthony Chen
- Department of Chemistry, New York University, New York, NY 10003, USA; (C.Y.); (E.A.C.)
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, NY 10003, USA; (C.Y.); (E.A.C.)
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
29
|
Dong L, Qu X, Wang B. XLPFE: A Simple and Effective Machine Learning Scoring Function for Protein-Ligand Scoring and Ranking. ACS OMEGA 2022; 7:21727-21735. [PMID: 35785279 PMCID: PMC9245135 DOI: 10.1021/acsomega.2c01723] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Accepted: 05/30/2022] [Indexed: 06/15/2023]
Abstract
Prediction of protein-ligand binding affinities is a central issue in structure-based computer-aided drug design. In recent years, much effort has been devoted to the prediction of the binding affinity in protein-ligand complexes using machine learning (ML). Due to the remarkable ability of ML methods in nonlinear fitting, ML-based scoring functions (SFs) can deliver much improved performance on a selected test set, such as the comparative assessment of scoring functions (CASF), when compared to the classical SFs. However, the performance of ML-based SFs heavily relies on the overall similarity of the training set and the test set. To improve the performance and transferability of an SF, we have tried to combine various features including energy terms from X-score and AutoDock Vina, the properties of ligands, and the statistical sequence-related information from either the binding site or the full protein. In conjunction with extreme trees (ET), an ML model, we have developed XLPFE, a new SF. Compared with other tested methods such as X-score, AutoDock Vina, ΔvinaXGB, PSH-ML, or CNN-score, XLPFE achieves consistently better scoring and ranking power for various types of protein-ligand complex structures beyond the CASF, suggesting that XLPFE has superior transferability. In particular, XLPFE performs better with metalloenzymes. With its faster speed, improved accuracy, and better transferability, XLPFE could be usefully applied to a diverse range of protein-ligand complexes.
Collapse
Affiliation(s)
- Lina Dong
- State
Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian
Provincial Key Laboratory of Theoretical and Computational Chemistry,
iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 360015, P. R. China
| | - Xiaoyang Qu
- State
Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian
Provincial Key Laboratory of Theoretical and Computational Chemistry,
College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 360015, P. R. China
| | - Binju Wang
- State
Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian
Provincial Key Laboratory of Theoretical and Computational Chemistry,
College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 360015, P. R. China
| |
Collapse
|
30
|
Meli R, Morris GM, Biggin PC. Scoring Functions for Protein-Ligand Binding Affinity Prediction using Structure-Based Deep Learning: A Review. FRONTIERS IN BIOINFORMATICS 2022; 2:885983. [PMID: 36187180 PMCID: PMC7613667 DOI: 10.3389/fbinf.2022.885983] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 05/11/2022] [Indexed: 01/01/2023] Open
Abstract
The rapid and accurate in silico prediction of protein-ligand binding free energies or binding affinities has the potential to transform drug discovery. In recent years, there has been a rapid growth of interest in deep learning methods for the prediction of protein-ligand binding affinities based on the structural information of protein-ligand complexes. These structure-based scoring functions often obtain better results than classical scoring functions when applied within their applicability domain. Here we review structure-based scoring functions for binding affinity prediction based on deep learning, focussing on different types of architectures, featurization strategies, data sets, methods for training and evaluation, and the role of explainable artificial intelligence in building useful models for real drug-discovery applications.
Collapse
Affiliation(s)
- Rocco Meli
- Department of Biochemistry, University of Oxford, Oxford, United Kingdom
| | - Garrett M. Morris
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| | - Philip C. Biggin
- Department of Biochemistry, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
31
|
Wang Y, Wei Z, Xi L. Sfcnn: a novel scoring function based on 3D convolutional neural network for accurate and stable protein-ligand affinity prediction. BMC Bioinformatics 2022; 23:222. [PMID: 35676617 PMCID: PMC9178885 DOI: 10.1186/s12859-022-04762-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2021] [Accepted: 06/01/2022] [Indexed: 01/09/2023] Open
Abstract
Background Computer-aided drug design provides an effective method of identifying lead compounds. However, success rates are significantly bottlenecked by the lack of accurate and reliable scoring functions needed to evaluate binding affinities of protein–ligand complexes. Therefore, many scoring functions based on machine learning or deep learning have been developed to improve prediction accuracies in recent years. In this work, we proposed a novel featurization method, generating a new scoring function model based on 3D convolutional neural network. Results This work showed the results from testing four architectures and three featurization methods, and outlined the development of a novel deep 3D convolutional neural network scoring function model. This model simplified feature engineering, and in combination with Grad-CAM made the intermediate layers of the neural network more interpretable. This model was evaluated and compared with other scoring functions on multiple independent datasets. The Pearson correlation coefficients between the predicted binding affinities by our model and the experimental data achieved 0.7928, 0.7946, 0.6758, and 0.6474 on CASF-2016 dataset, CASF-2013 dataset, CSAR_HiQ_NRC_set, and Astex_diverse_set, respectively. Overall, our model performed accurately and stably enough in the scoring power to predict the binding affinity of a protein–ligand complex. Conclusions These results indicate our model is an excellent scoring function, and performs well in scoring power for accurately and stably predicting the protein–ligand affinity. Our model will contribute towards improving the success rate of virtual screening, thus will accelerate the development of potential drugs or novel biologically active lead compounds. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04762-3.
Collapse
Affiliation(s)
- Yu Wang
- Chongqing Key Laboratory of Big Data for Bio Intelligence, Chongqing University of Posts and Tele-Communications, No. 2 Chongwen Road, Nan'an District, Chongqing, 400065, China.
| | - Zhengxiao Wei
- Department of Clinical Laboratory, Public Health Clinical Center of Chengdu, Chengdu, 610095, China
| | - Lei Xi
- Hubei Provincial Key Laboratory of Occurrence and Intervention of Rheumatic Diseases, Hubei Minzu University, Enshi, 445000, China
| |
Collapse
|
32
|
Yang C, Zhang Y. Delta Machine Learning to Improve Scoring-Ranking-Screening Performances of Protein-Ligand Scoring Functions. J Chem Inf Model 2022; 62:2696-2712. [PMID: 35579568 DOI: 10.1021/acs.jcim.2c00485] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Protein-ligand scoring functions are widely used in structure-based drug design for fast evaluation of protein-ligand interactions, and it is of strong interest to develop scoring functions with machine-learning approaches. In this work, by expanding the training set, developing physically meaningful features, employing our recently developed linear empirical scoring function Lin_F9 (Yang, C. J. Chem. Inf. Model. 2021, 61, 4630-4644) as the baseline, and applying extreme gradient boosting (XGBoost) with Δ-machine learning, we have further improved the robustness and applicability of machine-learning scoring functions. Besides the top performances for scoring-ranking-screening power tests of the CASF-2016 benchmark, the new scoring function ΔLin_F9XGB also achieves superior scoring and ranking performances in different structure types that mimic real docking applications. The scoring powers of ΔLin_F9XGB for locally optimized poses, flexible redocked poses, and ensemble docked poses of the CASF-2016 core set achieve Pearson's correlation coefficient (R) values of 0.853, 0.839, and 0.813, respectively. In addition, the large-scale docking-based virtual screening test on the LIT-PCBA data set demonstrates the reliability and robustness of ΔLin_F9XGB in virtual screening application. The ΔLin_F9XGB scoring function and its code are freely available on the web at (https://yzhang.hpc.nyu.edu/Delta_LinF9_XGB).
Collapse
Affiliation(s)
- Chao Yang
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
33
|
Pan X, Wang H, Zhang Y, Wang X, Li C, Ji C, Zhang JZH. AA-Score: a New Scoring Function Based on Amino Acid-Specific Interaction for Molecular Docking. J Chem Inf Model 2022; 62:2499-2509. [PMID: 35452230 DOI: 10.1021/acs.jcim.1c01537] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
The protein-ligand scoring function plays an important role in computer-aided drug discovery and is heavily used in virtual screening and lead optimization. In this study, we developed a new empirical protein-ligand scoring function with amino acid-specific interaction components for hydrogen bond, van der Waals, and electrostatic interactions. In addition, hydrophobic, π-stacking, π-cation, and metal-ligand interactions are also included in the new scoring function. To better evaluate the performance of the AA-Score, we generated several new test sets for evaluation of scoring, ranking, and docking performances, respectively. Extensive tests show that AA-Score performs well on scoring, docking, and ranking as compared to other widely used traditional scoring functions. The performance improvement of AA-Score benefits from the decomposition of individual interaction into amino acid-specific types. To facilitate applications, we developed an easy-to-use tool to analyze protein-ligand interaction fingerprint and predict binding affinity using the AA-Score. The source code and associated running examples can be found at https://github.com/xundrug/AA-Score-Tool.
Collapse
Affiliation(s)
- Xiaolin Pan
- Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| | - Hao Wang
- Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| | - Yueqing Zhang
- Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| | - Xingyu Wang
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| | - Cuiyu Li
- Advanced Computing East China Sub-center, Suma Technology Co., Ltd., Kunshan 215300, China
| | - Changge Ji
- Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| | - John Z H Zhang
- Shanghai Engineering Research Center of Molecular Therapeutics & New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China.,CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China.,Faculty of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China.,NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China.,Department of Chemistry, New York University, New York 10003, United States.,Collaborative Innovation Center of Extreme Optics, Shanxi University, Taiyuan Shanxi 030006, China
| |
Collapse
|
34
|
Machine learning-assisted non-destructive plasticizer identification and quantification in historical PVC objects based on IR spectroscopy. Sci Rep 2022; 12:5017. [PMID: 35322097 PMCID: PMC8943100 DOI: 10.1038/s41598-022-08862-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 03/15/2022] [Indexed: 12/23/2022] Open
Abstract
Non-destructive spectroscopic analysis combined with machine learning rapidly provides information on the identity and content of plasticizers in PVC objects of heritage value. For the first time, a large and diverse collection of more than 100 PVC objects in different degradation stages and of diverse chemical compositions was analysed by chromatographic and spectroscopic techniques to create a dataset used to construct classification and regression models. Accounting for this variety makes the model more robust and reliable for the analysis of objects in museum collections. Six different machine learning classification algorithms were compared to determine the algorithm with the highest classification accuracy of the most common plasticizers, based solely on the spectroscopic data. A classification model capable of the identification of di(2-ethylhexyl) phthalate, di(2-ethylhexyl) terephthalate, diisononyl phthalate, diisodecyl phthalate, a mixture of diisononyl phthalate and diisodecyl phthalate, and unplasticized PVC was constructed. Additionally, regression models for quantification of di(2-ethylhexyl) phthalate and di(2-ethylhexyl) terephthalate in PVC were built. This study of real-life objects demonstrates that classification and quantification of plasticizers in a general collection of degraded PVC objects is possible, providing valuable data to collection managers.
Collapse
|
35
|
Zheng L, Meng J, Jiang K, Lan H, Wang Z, Lin M, Li W, Guo H, Wei Y, Mu Y. Improving protein-ligand docking and screening accuracies by incorporating a scoring function correction term. Brief Bioinform 2022; 23:6548372. [PMID: 35289359 PMCID: PMC9116214 DOI: 10.1093/bib/bbac051] [Citation(s) in RCA: 45] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2021] [Revised: 01/30/2022] [Accepted: 01/31/2022] [Indexed: 12/13/2022] Open
Abstract
Scoring functions are important components in molecular docking for structure-based drug discovery. Traditional scoring functions, generally empirical- or force field-based, are robust and have proven to be useful for identifying hits and lead optimizations. Although multiple highly accurate deep learning- or machine learning-based scoring functions have been developed, their direct applications for docking and screening are limited. We describe a novel strategy to develop a reliable protein–ligand scoring function by augmenting the traditional scoring function Vina score using a correction term (OnionNet-SFCT). The correction term is developed based on an AdaBoost random forest model, utilizing multiple layers of contacts formed between protein residues and ligand atoms. In addition to the Vina score, the model considerably enhances the AutoDock Vina prediction abilities for docking and screening tasks based on different benchmarks (such as cross-docking dataset, CASF-2016, DUD-E and DUD-AD). Furthermore, our model could be combined with multiple docking applications to increase pose selection accuracies and screening abilities, indicating its wide usage for structure-based drug discoveries. Furthermore, in a reverse practice, the combined scoring strategy successfully identified multiple known receptors of a plant hormone. To summarize, the results show that the combination of data-driven model (OnionNet-SFCT) and empirical scoring function (Vina score) is a good scoring strategy that could be useful for structure-based drug discoveries and potentially target fishing in future.
Collapse
Affiliation(s)
- Liangzhen Zheng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China.,Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Jintao Meng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China.,National Supercomputer Center in Shenzhen, Shenzhen, 518000, China
| | - Kai Jiang
- Institute of Plant and Food Science, Department of Biology, School of Life Sciences, Southern University of Science and Technology (SUSTech), Shenzhen, Guangdong 518055, China
| | - Haidong Lan
- Tencent AI Lab, Shenzhen, Guangdong 518000, China
| | - Zechen Wang
- School of Physics, Shandong University, Jinan, Shandong 250101, China
| | - Mingzhi Lin
- Shanghai Zelixir Biotech Company Ltd., Shanghai 200030, China
| | - Weifeng Li
- School of Physics, Shandong University, Jinan, Shandong 250101, China
| | - Hongwei Guo
- Institute of Plant and Food Science, Department of Biology, School of Life Sciences, Southern University of Science and Technology (SUSTech), Shenzhen, Guangdong 518055, China
| | - Yanjie Wei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
| | - Yuguang Mu
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive 637551, Singapore
| |
Collapse
|
36
|
Choudhury C, Arul Murugan N, Deva Priyakumar U. Structure-based drug repurposing: traditional and advanced AI/ML-aided methods. Drug Discov Today 2022; 27:1847-1861. [PMID: 35301148 PMCID: PMC8920090 DOI: 10.1016/j.drudis.2022.03.006] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Revised: 02/16/2022] [Accepted: 03/10/2022] [Indexed: 02/08/2023]
Abstract
The current global health emergency in the form of the Coronavirus 2019 (COVID-19) pandemic has highlighted the need for fast, accurate, and efficient drug discovery pipelines. Traditional drug discovery projects relying on in vitro high-throughput screening (HTS) involve large investments and sophisticated experimental set-ups, affordable only to big biopharmaceutical companies. In this scenario, application of efficient state-of-the-art computational methods and modern artificial intelligence (AI)-based algorithms for rapid screening of repurposable chemical space [approved drugs and natural products (NPs) with proven pharmacokinetic profiles] to identify the initial leads is a powerful option to save resources and time. Structure-based drug repurposing is a popular in silico repurposing approach. In this review, we discuss traditional and modern AI-based computational methods and tools applied at various stages for structure-based drug discovery (SBDD) pipelines. Additionally, we highlight the role of generative models in generating molecules with scaffolds from repurposable chemical space. Teaser: This review highlights the importance of repurposable chemical space, and the contributions of conventional in silico approaches and modern machine-learning algorithms for rapid structure-based drug repurposing.
Collapse
Affiliation(s)
- Chinmayee Choudhury
- Department of Experimental Medicine and Biotechnology, Postgraduate Institute of Medical Education and Research, Sector-12, Chandigarh 160012, India
| | - N Arul Murugan
- Department of Computer Science, School of Electrical Engineering and Computer Sciences, KTH Royal Institute of Technology, S-100 44, Stockholm, Sweden; Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi 110020, India.
| | - U Deva Priyakumar
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500 032, India
| |
Collapse
|
37
|
Dong L, Qu X, Zhao Y, Wang B. Prediction of Binding Free Energy of Protein-Ligand Complexes with a Hybrid Molecular Mechanics/Generalized Born Surface Area and Machine Learning Method. ACS OMEGA 2021; 6:32938-32947. [PMID: 34901645 PMCID: PMC8655939 DOI: 10.1021/acsomega.1c04996] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Accepted: 11/10/2021] [Indexed: 06/14/2023]
Abstract
Accurate prediction of protein-ligand binding free energies is important in enzyme engineering and drug discovery. The molecular mechanics/generalized Born surface area (MM/GBSA) approach is widely used to estimate ligand-binding affinities, but its performance heavily relies on the accuracy of its energy components. A hybrid strategy combining MM/GBSA and machine learning (ML) has been developed to predict the binding free energies of protein-ligand systems. Based on the MM/GBSA energy terms and several features associated with protein-ligand interactions, our ML-based scoring function, GXLE, shows much better performance than MM/GBSA without entropy. In particular, the good transferability of the GXLE model is highlighted by its good performance in ranking power for prediction of the binding affinity of different ligands for either the docked structures or crystal structures. The GXLE scoring function and its code are freely available and can be used to correct the binding free energies computed by MM/GBSA.
Collapse
Affiliation(s)
- Lina Dong
- State
Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian
Provincial Key Laboratory of Theoretical and Computational Chemistry,
iChEM, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 360015, P. R. China
| | - Xiaoyang Qu
- State
Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian
Provincial Key Laboratory of Theoretical and Computational Chemistry,
College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 360015, P. R. China
| | - Yuan Zhao
- The
Key Laboratory of Natural Medicine and Immuno-Engineering, Henan University, Kaifeng 475004, P. R.
China
| | - Binju Wang
- State
Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian
Provincial Key Laboratory of Theoretical and Computational Chemistry,
College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 360015, P. R. China
| |
Collapse
|
38
|
Li H, Lu G, Sze KH, Su X, Chan WY, Leung KS. Machine-learning scoring functions trained on complexes dissimilar to the test set already outperform classical counterparts on a blind benchmark. Brief Bioinform 2021; 22:bbab225. [PMID: 34169324 PMCID: PMC8575004 DOI: 10.1093/bib/bbab225] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 04/27/2021] [Accepted: 05/23/2021] [Indexed: 11/12/2022] Open
Abstract
The superior performance of machine-learning scoring functions for docking has caused a series of debates on whether it is due to learning knowledge from training data that are similar in some sense to the test data. With a systematically revised methodology and a blind benchmark realistically mimicking the process of prospective prediction of binding affinity, we have evaluated three broadly used classical scoring functions and five machine-learning counterparts calibrated with both random forest and extreme gradient boosting using both solo and hybrid features, showing for the first time that machine-learning scoring functions trained exclusively on a proportion of as low as 8% complexes dissimilar to the test set already outperform classical scoring functions, a percentage that is far lower than what has been recently reported on all the three CASF benchmarks. The performance of machine-learning scoring functions is underestimated due to the absence of similar samples in some artificially created training sets that discard the full spectrum of complexes to be found in a prospective environment. Given the inevitability of any degree of similarity contained in a large dataset, the criteria for scoring function selection depend on which one can make the best use of all available materials. Software code and data are provided at https://github.com/cusdulab/MLSF for interested readers to rapidly rebuild the scoring functions and reproduce our results, even to make extended analyses on their own benchmarks.
Collapse
Affiliation(s)
| | - Gang Lu
- School of Biomedical Sciences, Chinese University of Hong Kong, Hong Kong
| | - Kam-Heung Sze
- Bioinformatics Unit, Hong Kong Medical Technology Institute, Hong Kong
| | - Xianwei Su
- Chinese University of Hong Kong, Hong Kong
| | - Wai-Yee Chan
- CUHK-SDU Joint Laboratory on Reproductive Genetics, School of Biomedical Sciences, Chinese University of Hong Kong, Hong Kong
| | - Kwong-Sak Leung
- Computer Science and Engineering in the Chinese University of Hong Kong, Hong Kong
| |
Collapse
|
39
|
Shen C, Hu X, Gao J, Zhang X, Zhong H, Wang Z, Xu L, Kang Y, Cao D, Hou T. The impact of cross-docked poses on performance of machine learning classifier for protein-ligand binding pose prediction. J Cheminform 2021; 13:81. [PMID: 34656169 PMCID: PMC8520186 DOI: 10.1186/s13321-021-00560-w] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 10/05/2021] [Indexed: 02/06/2023] Open
Abstract
Structure-based drug design depends on the detailed knowledge of the three-dimensional (3D) structures of protein-ligand binding complexes, but accurate prediction of ligand-binding poses is still a major challenge for molecular docking due to deficiency of scoring functions (SFs) and ignorance of protein flexibility upon ligand binding. In this study, based on a cross-docking dataset dedicatedly constructed from the PDBbind database, we developed several XGBoost-trained classifiers to discriminate the near-native binding poses from decoys, and systematically assessed their performance with/without the involvement of the cross-docked poses in the training/test sets. The calculation results illustrate that using Extended Connectivity Interaction Features (ECIF), Vina energy terms and docking pose ranks as the features can achieve the best performance, according to the validation through the random splitting or refined-core splitting and the testing on the re-docked or cross-docked poses. Besides, it is found that, despite the significant decrease of the performance for the threefold clustered cross-validation, the inclusion of the Vina energy terms can effectively ensure the lower limit of the performance of the models and thus improve their generalization capability. Furthermore, our calculation results also highlight the importance of the incorporation of the cross-docked poses into the training of the SFs with wide application domain and high robustness for binding pose prediction. The source code and the newly-developed cross-docking datasets can be freely available at https://github.com/sc8668/ml_pose_prediction and https://zenodo.org/record/5525936 , respectively, under an open-source license. We believe that our study may provide valuable guidance for the development and assessment of new machine learning-based SFs (MLSFs) for the predictions of protein-ligand binding poses.
Collapse
Affiliation(s)
- Chao Shen
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China.,State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China
| | - Xueping Hu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China
| | - Junbo Gao
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China
| | - Xujun Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China
| | - Haiyang Zhong
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China
| | - Zhe Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou, 213001, China
| | - Yu Kang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China.
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, Hunan, 410013, People's Republic of China.
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China. .,State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, 310058, People's Republic of China.
| |
Collapse
|
40
|
Abstract
Molecular docking is one of the most widely used computational tools in structure-based drug design and is critically dependent on accuracy and robustness of the scoring function. In this work, we introduce a new scoring function Lin_F9, which is a linear combination of nine empirical terms, including a unified metal bond term to specifically describe metal-ligand interactions. Parameters in Lin_F9 are obtained with a multistage fitting protocol using explicit water-included structures. For the CASF-2016 benchmark test set, Lin_F9 achieves the top scoring power among all 34 classical scoring functions for both original crystal poses and locally optimized poses with Pearson correlation coefficients (R) of 0.680 and 0.687, respectively. Meanwhile, in comparison with Vina, Lin_F9 achieves consistently better scoring power and ranking power with various types of protein-ligand complex structures that mimic real docking applications, including end-to-end flexible docking for the CASF-2016 benchmark test set using a single or an ensemble of protein receptor structures, as well as for D3R Grand Challenge (GC4) test sets. Lin_F9 has been implemented in a fork of Smina as an optional built-in scoring function that can be used for docking applications as well as for further improvement of scoring functions and docking protocols. Lin_F9 is accessible through https://yzhang.hpc.nyu.edu/Lin_F9/.
Collapse
Affiliation(s)
- Chao Yang
- Department of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, New York 10003, United States
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
41
|
Meli R, Anighoro A, Bodkin MJ, Morris GM, Biggin PC. Learning protein-ligand binding affinity with atomic environment vectors. J Cheminform 2021; 13:59. [PMID: 34391475 PMCID: PMC8364054 DOI: 10.1186/s13321-021-00536-w] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Accepted: 07/21/2021] [Indexed: 12/03/2022] Open
Abstract
Scoring functions for the prediction of protein-ligand binding affinity have seen renewed interest in recent years when novel machine learning and deep learning methods started to consistently outperform classical scoring functions. Here we explore the use of atomic environment vectors (AEVs) and feed-forward neural networks, the building blocks of several neural network potentials, for the prediction of protein-ligand binding affinity. The AEV-based scoring function, which we term AEScore, is shown to perform as well or better than other state-of-the-art scoring functions on binding affinity prediction, with an RMSE of 1.22 pK units and a Pearson’s correlation coefficient of 0.83 for the CASF-2016 benchmark. However, AEScore does not perform as well in docking and virtual screening tasks, for which it has not been explicitly trained. Therefore, we show that the model can be combined with the classical scoring function AutoDock Vina in the context of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\Delta$$\end{document}Δ-learning, where corrections to the AutoDock Vina scoring function are learned instead of the protein-ligand binding affinity itself. Combined with AutoDock Vina, \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\Delta$$\end{document}Δ-AEScore has an RMSE of 1.32 pK units and a Pearson’s correlation coefficient of 0.80 on the CASF-2016 benchmark, while retaining the docking and screening power of the underlying classical scoring function.
Collapse
Affiliation(s)
- Rocco Meli
- Department of Biochemistry, University of Oxford, Oxford, UK
| | | | | | | | - Philip C Biggin
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
42
|
Sanner MF, Dieguez L, Forli S, Lis E. Improving Docking Power for Short Peptides Using Random Forest. J Chem Inf Model 2021; 61:3074-3090. [PMID: 34124893 PMCID: PMC8543977 DOI: 10.1021/acs.jcim.1c00573] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
In recent years, therapeutic peptides have gained a lot interest as demonstrated by the 60 peptides approved as drugs in major markets and 150+ peptides currently in clinical trials. However, while small molecule docking is routinely used in rational drug design efforts, docking peptides has proven challenging partly because docking scoring functions, developed and calibrated for small molecules, perform poorly for these molecules. Here, we present random forest classifiers trained to discriminate correctly docked peptides. We show that, for a testing set of 47 protein-peptide complexes, structurally dissimilar from the training set and previously used to benchmark AutoDock Vina's ability to dock short peptides, these random forest classifiers improve docking power from ∼25% for AutoDock scoring functions to an average of ∼70%. These results pave the way for peptide-docking success rates comparable to those of small molecule docking. To develop these classifiers, we compiled the ProptPep37_2021 data set, a curated, high-quality set of 322 crystallographic protein-peptides complexes annotated with structural similarity information. The data set also provides a collection of high-quality putative poses with a range of deviations from the crystallographic pose, providing correct and incorrect poses (i.e., decoys) of the peptide for each entry. The ProptPep37_2021 data set as well as the classifiers presented here are freely available.
Collapse
Affiliation(s)
- Michel F. Sanner
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 93037, USA
| | - Leonard Dieguez
- Koliber Biosciences Inc., 12265 World Trade Drive, Suite G, San Diego, CA 92128, USA
| | - Stefano Forli
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 93037, USA
| | - Ewa Lis
- Koliber Biosciences Inc., 12265 World Trade Drive, Suite G, San Diego, CA 92128, USA
| |
Collapse
|
43
|
Bao J, He X, Zhang JZH. DeepBSP-a Machine Learning Method for Accurate Prediction of Protein-Ligand Docking Structures. J Chem Inf Model 2021; 61:2231-2240. [PMID: 33979150 DOI: 10.1021/acs.jcim.1c00334] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
In recent years, machine-learning-based scoring functions have significantly improved the scoring power. However, many of these methods do not perform well in distinguishing the native structure from docked decoy poses due to the lack of decoy structural information in their training data. Here, we developed a machine-learning model, named DeepBSP, that can directly predict the root mean square deviation (rmsd) of a ligand docking pose with reference to its native binding pose. Unlike the binding affinity, the rmsd between the docking poses with reference to their native structures can be straightforwardly determined. By training on a generated data set with 11,925 native complexes and more than 165,000 docked poses, our model shows excellent docking power on our test set and also on the CASF-2016 docking decoy set compared to other major scoring functions. Thus, by combining molecular dockings that generate many poses with the application of DeepBSP, one can more accurately predict the best binding pose that is closest to the native complex structure. This DeepBSP model shall be very useful in picking out poses close to their natives from many poses generated from a dock application.
Collapse
Affiliation(s)
- Jingxiao Bao
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
| | - Xiao He
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China.,NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China
| | - John Z H Zhang
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China.,NYU-ECNU Center for Computational Chemistry, NYU Shanghai, Shanghai 200062, China.,Department of Chemistry, New York University, New York, New York 10003, United States.,Collaborative Innovation Center of Extreme Optics, Shanxi University, Taiyuan, Shanxi 030006, China
| |
Collapse
|
44
|
Wee J, Xia K. Forman persistent Ricci curvature (FPRC)-based machine learning models for protein-ligand binding affinity prediction. Brief Bioinform 2021; 22:6262241. [PMID: 33940588 DOI: 10.1093/bib/bbab136] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Revised: 03/14/2021] [Accepted: 03/23/2021] [Indexed: 01/01/2023] Open
Abstract
Artificial intelligence (AI) techniques have already been gradually applied to the entire drug design process, from target discovery, lead discovery, lead optimization and preclinical development to the final three phases of clinical trials. Currently, one of the central challenges for AI-based drug design is molecular featurization, which is to identify or design appropriate molecular descriptors or fingerprints. Efficient and transferable molecular descriptors are key to the success of all AI-based drug design models. Here we propose Forman persistent Ricci curvature (FPRC)-based molecular featurization and feature engineering, for the first time. Molecular structures and interactions are modeled as simplicial complexes, which are generalization of graphs to their higher dimensional counterparts. Further, a multiscale representation is achieved through a filtration process, during which a series of nested simplicial complexes at different scales are generated. Forman Ricci curvatures (FRCs) are calculated on the series of simplicial complexes, and the persistence and variation of FRCs during the filtration process is defined as FPRC. Moreover, persistent attributes, which are FPRC-based functions and properties, are employed as molecular descriptors, and combined with machine learning models, in particular, gradient boosting tree (GBT). Our FPRC-GBT models are extensively trained and tested on three most commonly-used datasets, including PDBbind-2007, PDBbind-2013 and PDBbind-2016. It has been found that our results are better than the ones from machine learning models with traditional molecular descriptors.
Collapse
Affiliation(s)
- JunJie Wee
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371
| |
Collapse
|
45
|
Taking the leap between analytical chemistry and artificial intelligence: A tutorial review. Anal Chim Acta 2021; 1161:338403. [DOI: 10.1016/j.aca.2021.338403] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2020] [Revised: 03/02/2021] [Accepted: 03/03/2021] [Indexed: 01/01/2023]
|
46
|
Huang DZ, Baber JC, Bahmanyar SS. The challenges of generalizability in artificial intelligence for ADME/Tox endpoint and activity prediction. Expert Opin Drug Discov 2021; 16:1045-1056. [PMID: 33739897 DOI: 10.1080/17460441.2021.1901685] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
INTRODUCTION Artificial intelligence (AI) has seen a massive resurgence in recent years with wide successes in computer vision, natural language processing, and games. The similar creation of robust and accurate AI models for ADME/Tox endpoint and activity prediction would be revolutionary to drug discovery pipelines. There have been numerous demonstrations of successful applications, but a key challenge remains: how generalizable are these predictive models? AREAS COVERED The authors present a summary of current promising components of AI models in the context of early drug discovery where ADME/Tox endpoint and activity prediction is the main driver of the iterative drug design process. Following that is a review of applicability domains and dataset construction considerations which determine generalizability bottlenecks for AI deployment. Further reviewed is the role of promising learning frameworks - multitask, transfer, and meta learning - which leverage auxiliary data to overcome issues of generalizability. EXPERT OPINION The authors conclude that the most promising direction toward integrating reliable and informative AI models into the drug discovery pipeline is a conjunction of learned feature representations, deep learning, and novel learning frameworks. Such a solution would address the sparse and incomplete datasets that are available for key endpoints related to drug discovery.
Collapse
Affiliation(s)
| | - J Christian Baber
- Scientific Informatics, Global Head of Scientific Informatics, Scientific Informatics, Takeda Pharmaceuticals, Cambridge, MA, USA
| | - Sogole Sami Bahmanyar
- Computational Chemistry, Director of Computational Sciences, Computational Chemistry, Takeda Pharmaceuticals, San Diego, USA
| |
Collapse
|
47
|
Liu Q, Wang PS, Zhu C, Gaines BB, Zhu T, Bi J, Song M. OctSurf: Efficient hierarchical voxel-based molecular surface representation for protein-ligand affinity prediction. J Mol Graph Model 2021; 105:107865. [PMID: 33640787 DOI: 10.1016/j.jmgm.2021.107865] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2020] [Revised: 02/03/2021] [Accepted: 02/04/2021] [Indexed: 10/22/2022]
Abstract
Voxel-based 3D convolutional neural networks (CNNs) have been applied to predict protein-ligand binding affinity. However, the memory usage and computation cost of these voxel-based approaches increase cubically with respect to spatial resolution and sometimes make volumetric CNNs intractable at higher resolutions. Therefore, it is necessary to develop memory-efficient alternatives that can accelerate the convolutional operation on 3D volumetric representations of the protein-ligand interaction. In this study, we implement a novel volumetric representation, OctSurf, to characterize the 3D molecular surface of protein binding pockets and bound ligands. The OctSurf surface representation is built based on the octree data structure, which has been widely used in computer graphics to efficiently represent and store 3D object data. Vanilla 3D-CNN approaches often divide the 3D space of objects into equal-sized voxels. In contrast, OctSurf recursively partitions the 3D space containing the protein-ligand pocket into eight subspaces called octants. Only those octants containing van der Waals surface points of protein or ligand atoms undergo the recursive subdivision process until they reach the predefined octree depth, whereas unoccupied octants are kept intact to reduce the memory cost. Resulting non-empty leaf octants approximate molecular surfaces of the protein pocket and bound ligands. These surface octants, along with their chemical and geometric features, are used as the input to 3D-CNNs. Two kinds of CNN architectures, VGG and ResNet, are applied to the OctSurf representation to predict binding affinity. The OctSurf representation consumes much less memory than the conventional voxel representation at the same resolution. By restricting the convolution operation to only octants of the smallest size, our method also alleviates the overall computational overhead of CNN. A series of experiments are performed to demonstrate the disk storage and computational efficiency of the proposed learning method. Our code is available at the following GitHub repository: https://github.uconn.edu/mldrugdiscovery/OctSurf.
Collapse
Affiliation(s)
- Qinqing Liu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06279, USA
| | | | - Chunjiang Zhu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06279, USA
| | - Blake Blumenfeld Gaines
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06279, USA
| | - Tan Zhu
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06279, USA
| | - Jinbo Bi
- Department of Computer Science and Engineering, University of Connecticut, Storrs, CT 06279, USA; Department of Biomedical Engineering, University of Connecticut, Storrs, CT 06279, USA
| | - Minghu Song
- Department of Biomedical Engineering, University of Connecticut, Storrs, CT 06279, USA.
| |
Collapse
|
48
|
Guedes IA, Barreto AMS, Marinho D, Krempser E, Kuenemann MA, Sperandio O, Dardenne LE, Miteva MA. New machine learning and physics-based scoring functions for drug discovery. Sci Rep 2021; 11:3198. [PMID: 33542326 PMCID: PMC7862620 DOI: 10.1038/s41598-021-82410-1] [Citation(s) in RCA: 79] [Impact Index Per Article: 19.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 01/20/2021] [Indexed: 12/11/2022] Open
Abstract
Scoring functions are essential for modern in silico drug discovery. However, the accurate prediction of binding affinity by scoring functions remains a challenging task. The performance of scoring functions is very heterogeneous across different target classes. Scoring functions based on precise physics-based descriptors better representing protein–ligand recognition process are strongly needed. We developed a set of new empirical scoring functions, named DockTScore, by explicitly accounting for physics-based terms combined with machine learning. Target-specific scoring functions were developed for two important drug targets, proteases and protein–protein interactions, representing an original class of molecules for drug discovery. Multiple linear regression (MLR), support vector machine and random forest algorithms were employed to derive general and target-specific scoring functions involving optimized MMFF94S force-field terms, solvation and lipophilic interactions terms, and an improved term accounting for ligand torsional entropy contribution to ligand binding. DockTScore scoring functions demonstrated to be competitive with the current best-evaluated scoring functions in terms of binding energy prediction and ranking on four DUD-E datasets and will be useful for in silico drug design for diverse proteins as well as for specific targets such as proteases and protein–protein interactions. Currently, the MLR DockTScore is available at www.dockthor.lncc.br.
Collapse
Affiliation(s)
- Isabella A Guedes
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil.,Inserm U973, Université Paris Diderot, Paris, France
| | - André M S Barreto
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil
| | - Diogo Marinho
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil
| | | | | | - Olivier Sperandio
- Inserm U973, Université Paris Diderot, Paris, France.,Structural Bioinformatics Unit, CNRS UMR3528, Institut Pasteur, 75015, Paris, France
| | - Laurent E Dardenne
- Laboratório Nacional de Computação Científica, Petrópolis, 25651-075, Brazil.
| | - Maria A Miteva
- Inserm U973, Université Paris Diderot, Paris, France. .,Inserm U1268 "Medicinal Chemistry and Translational Research", CiTCoM, UMR 8038, CNRS, Université de Paris, 75006, Paris, France.
| |
Collapse
|
49
|
Shen C, Weng G, Zhang X, Leung ELH, Yao X, Pang J, Chai X, Li D, Wang E, Cao D, Hou T. Accuracy or novelty: what can we gain from target-specific machine-learning-based scoring functions in virtual screening? Brief Bioinform 2021; 22:6070382. [PMID: 33418562 DOI: 10.1093/bib/bbaa410] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2020] [Revised: 11/26/2020] [Accepted: 12/12/2020] [Indexed: 12/13/2022] Open
Abstract
Machine-learning (ML)-based scoring functions (MLSFs) have gradually emerged as a promising alternative for protein-ligand binding affinity prediction and structure-based virtual screening. However, clouds of doubts have still been raised against the benefits of this novel type of scoring functions (SFs). In this study, to benchmark the performance of target-specific MLSFs on a relatively unbiased dataset, the MLSFs trained from three representative protein-ligand interaction representations were assessed on the LIT-PCBA dataset, and the classical Glide SP SF and three types of ligand-based quantitative structure-activity relationship (QSAR) models were also utilized for comparison. Two major aspects in virtual screening campaigns, including prediction accuracy and hit novelty, were systematically explored. The calculation results illustrate that the tested target-specific MLSFs yielded generally superior performance over the classical Glide SP SF, but they could hardly outperform the 2D fingerprint-based QSAR models. Although substantial improvements could be achieved by integrating multiple types of protein-ligand interaction features, the MLSFs were still not sufficient to exceed MACCS-based QSAR models. In terms of the correlations between the hit ranks or the structures of the top-ranked hits, the MLSFs developed by different featurization strategies would have the ability to identify quite different hits. Nevertheless, it seems that target-specific MLSFs do not have the intrinsic attributes of a traditional SF and may not be a substitute for classical SFs. In contrast, MLSFs can be regarded as a new derivative of ligand-based QSAR models. It is expected that our study may provide valuable guidance for the assessment and further development of target-specific MLSFs.
Collapse
Affiliation(s)
- Chao Shen
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Gaoqi Weng
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Xujun Zhang
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Elaine Lai-Han Leung
- State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macau, SAR, China
| | - Xiaojun Yao
- State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macau, SAR, China
| | - Jinping Pang
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Xin Chai
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Dan Li
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Ercheng Wang
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Dongsheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China
| | - Tingjun Hou
- Hangzhou Institute of Innovative Medicine, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| |
Collapse
|
50
|
Holderbach S, Adam L, Jayaram B, Wade RC, Mukherjee G. RASPD+: Fast Protein-Ligand Binding Free Energy Prediction Using Simplified Physicochemical Features. Front Mol Biosci 2020; 7:601065. [PMID: 33392260 PMCID: PMC7773945 DOI: 10.3389/fmolb.2020.601065] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 11/13/2020] [Indexed: 01/17/2023] Open
Abstract
The virtual screening of large numbers of compounds against target protein binding sites has become an integral component of drug discovery workflows. This screening is often done by computationally docking ligands into a protein binding site of interest, but this has the drawback of a large number of poses that must be evaluated to obtain accurate estimates of protein-ligand binding affinity. We here introduce a fast pre-filtering method for ligand prioritization that is based on a set of machine learning models and uses simple pose-invariant physicochemical descriptors of the ligands and the protein binding pocket. Our method, Rapid Screening with Physicochemical Descriptors + machine learning (RASPD+), is trained on PDBbind data and achieves a regression performance that is better than that of the original RASPD method and traditional scoring functions on a range of different test sets without the need for generating ligand poses. Additionally, we use RASPD+ to identify molecular features important for binding affinity and assess the ability of RASPD+ to enrich active molecules from decoys.
Collapse
Affiliation(s)
- Stefan Holderbach
- Molecular and Cellular Modelling Group, Heidelberg Institute of Theoretical Studies, Heidelberg, Germany
- Institute of Pharmacy and Molecular Biotechnology (IPMB), Heidelberg University, Heidelberg, Germany
| | - Lukas Adam
- Molecular and Cellular Modelling Group, Heidelberg Institute of Theoretical Studies, Heidelberg, Germany
- Institute of Pharmacy and Molecular Biotechnology (IPMB), Heidelberg University, Heidelberg, Germany
| | - B. Jayaram
- Supercomputing Facility for Bioinformatics & Computational Biology, Department of Chemistry, Kusuma School of Biological Sciences, Indian Institute of Technology Delhi, New Delhi, India
| | - Rebecca C. Wade
- Molecular and Cellular Modelling Group, Heidelberg Institute of Theoretical Studies, Heidelberg, Germany
- Center for Molecular Biology (ZMBH), DKFZ-ZMBH Alliance, Heidelberg University, Heidelberg, Germany
- Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Heidelberg, Germany
| | - Goutam Mukherjee
- Molecular and Cellular Modelling Group, Heidelberg Institute of Theoretical Studies, Heidelberg, Germany
- Center for Molecular Biology (ZMBH), DKFZ-ZMBH Alliance, Heidelberg University, Heidelberg, Germany
| |
Collapse
|