1
|
Zhou H, Skolnick J. Utility of the Morgan Fingerprint in Structure-Based Virtual Ligand Screening. J Phys Chem B 2024; 128:5363-5370. [PMID: 38783525 PMCID: PMC11163432 DOI: 10.1021/acs.jpcb.4c01875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/10/2024] [Accepted: 05/14/2024] [Indexed: 05/25/2024]
Abstract
In modern drug discovery, virtual ligand screening (VLS) is frequently applied to identify possible hits before experimental testing and refinement due to its cost-effective nature for large compound libraries. For decades, efforts have been devoted to developing VLS methods with high accuracy. These include the state-of-the-art FINDSITE suite of approaches FINDSITEcomb2.0, FRAGSITE, and FRAGSITE2 and the meta version FRAGSITEcomb that were developed in our lab. These methods combine ligand homology modeling (LHM), traditional ligand similarity methods, and more recently machine learning approaches to rank ligands and have proven to be superior to most recent deep learning and large language model-based approaches. Here, we describe further improvements to our previous best methods by combining the Morgan fingerprint (MF) with the originally used PubChem fingerprint and FP2 fingerprint. We then benchmarked FINDSITEcomb2.0M, FRAGSITEM, FRAGSITE2M, and the composite meta-approach FRAGSITEcombM. On the 102 target DUD-E set, the 1% enrichment factor (EF1%) and area under the precision-recall curve (AUPR) of FRAGSITEcomb increased from 42.0/0.59 to 47.6/0.72. This 0.72 AUPR is significantly better than that of the state-of-the-art deep learning-based method DenseFS's AUPR of 0.443. An independent test on the 81 targets DEKOIS2.0 set shows that EF1%/AUPR increases from 18.3/0.520 to 23.1/0.683. An ablation investigation shows that the MF contributes to most of the improvement of all four approaches. Thus, the MF is a useful addition to structure-based VLS.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems
Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Jeffrey Skolnick
- Center for the Study of Systems
Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| |
Collapse
|
2
|
Tian T, Li S, Zhang Z, Chen L, Zou Z, Zhao D, Zeng J. Benchmarking compound activity prediction for real-world drug discovery applications. Commun Chem 2024; 7:127. [PMID: 38834746 DOI: 10.1038/s42004-024-01204-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 05/16/2024] [Indexed: 06/06/2024] Open
Abstract
Identifying active compounds for target proteins is fundamental in early drug discovery. Recently, data-driven computational methods have demonstrated promising potential in predicting compound activities. However, there lacks a well-designed benchmark to comprehensively evaluate these methods from a practical perspective. To fill this gap, we propose a Compound Activity benchmark for Real-world Applications (CARA). Through carefully distinguishing assay types, designing train-test splitting schemes and selecting evaluation metrics, CARA can consider the biased distribution of current real-world compound activity data and avoid overestimation of model performances. We observed that although current models can make successful predictions for certain proportions of assays, their performances varied across different assays. In addition, evaluation of several few-shot training strategies demonstrated different performances related to task types. Overall, we provide a high-quality dataset for developing and evaluating compound activity prediction models, and the analyses in this work may inspire better applications of data-driven models in drug discovery.
Collapse
Affiliation(s)
- Tingzhong Tian
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Shuya Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Ziting Zhang
- Department of Automation, Tsinghua University, Beijing, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing, China
| | - Lin Chen
- Silexon AI Technology Co., Ltd., Nanjing, Jiangsu Province, China
| | - Ziheng Zou
- Silexon AI Technology Co., Ltd., Nanjing, Jiangsu Province, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
- School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China.
| |
Collapse
|
3
|
Robson B, Cooper R. Glass Box and Black Box Machine Learning Approaches to Exploit Compositional Descriptors of Molecules in Drug Discovery and Aid the Medicinal Chemist. ChemMedChem 2024:e202400169. [PMID: 38837320 DOI: 10.1002/cmdc.202400169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 05/29/2024] [Accepted: 06/03/2024] [Indexed: 06/07/2024]
Abstract
The synthetic medicinal chemist plays a vital role in drug discovery. Today there are AI tools to guide next syntheses, but many are "Black Boxes" (BB). One learns little more than the prediction made. There are now also AI methods emphasizing visibility and "explainability" (thus explainable AI or XAI) that could help when "compositional data" are used, but they often still start from seemingly arbitrary learned weights and lack familiar probabilistic measures based on observation and counting from the outset. If probabilistic methods were used in a complementary way with BB methods and demonstrated comparable predictive power, they would provide guidelines about what groups to include and avoid in next syntheses and quantify the relationships in probabilistic terms. These points are demonstrated by blind test comparison of two main types of BB methods and a probabilistic "Glass Box" (GB) method new outside of medicine, but which appears well suited to the above. Because many probabilities can be involved, emphasis is on the predictive power of its simplest explanatory models. There are usually more inactive compounds by orders of magnitude, often a problem for machine learning methods. However, the approaches used here appear to work well for such "real world data".
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc., 2723 Rocklyn Road, Cleveland, OH-44122, USA
- The Dirac Foundation, c/o The Academy Partnership Ltd., Windrush Park, Witney, OX2929, UK
| | - Richard Cooper
- Oxford Drug Design, Oxford Centre for Innovation, New Rd, Oxford, OX1 3TA, UK
- Department of Chemistry, 12 Mansfield Road, Oxford, OX1 1BY, UK
| |
Collapse
|
4
|
Moshawih S, Bu ZH, Goh HP, Kifli N, Lee LH, Goh KW, Ming LC. Consensus holistic virtual screening for drug discovery: a novel machine learning model approach. J Cheminform 2024; 16:62. [PMID: 38807196 PMCID: PMC11134635 DOI: 10.1186/s13321-024-00855-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 05/10/2024] [Indexed: 05/30/2024] Open
Abstract
In drug discovery, virtual screening is crucial for identifying potential hit compounds. This study aims to present a novel pipeline that employs machine learning models that amalgamates various conventional screening methods. A diverse array of protein targets was selected, and their corresponding datasets were subjected to active/decoy distribution analysis prior to scoring using four distinct methods: QSAR, Pharmacophore, docking, and 2D shape similarity, which were ultimately integrated into a single consensus score. The fine-tuned machine learning models were ranked using the novel formula "w_new", consensus scores were calculated, and an enrichment study was performed for each target. Distinctively, consensus scoring outperformed other methods in specific protein targets such as PPARG and DPP4, achieving AUC values of 0.90 and 0.84, respectively. Remarkably, this approach consistently prioritized compounds with higher experimental PIC50 values compared to all other screening methodologies. Moreover, the models demonstrated a range of moderate to high performance in terms of R2 values during external validation. In conclusion, this novel workflow consistently delivered superior results, emphasizing the significance of a holistic approach in drug discovery, where both quantitative metrics and active enrichment play pivotal roles in identifying the best virtual screening methodology.Scientific contributionWe presented a novel consensus scoring workflow in virtual screening, merging diverse methods for enhanced compound selection. We also introduced 'w_new', a groundbreaking metric that intricately refines machine learning model rankings by weighing various model-specific parameters, revolutionizing their efficacy in drug discovery in addition to other domains.
Collapse
Affiliation(s)
- Said Moshawih
- PAPRSB Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam.
- Faculty of Data Science and Information Technology, INTI International University, Nilai, Malaysia.
| | - Zhen Hui Bu
- Faculty of Computing and Engineering, Quest International University, Ipoh, Malaysia
| | - Hui Poh Goh
- PAPRSB Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Nurolaini Kifli
- PAPRSB Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Lam Hong Lee
- Faculty of Computing and Engineering, Quest International University, Ipoh, Malaysia
| | - Khang Wen Goh
- Faculty of Data Science and Information Technology, INTI International University, Nilai, Malaysia
| | - Long Chiau Ming
- PAPRSB Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
- School of Medical and Life Sciences, Sunway University, Sunway City, Malaysia
| |
Collapse
|
5
|
Orsi M, Reymond JL. One chiral fingerprint to find them all. J Cheminform 2024; 16:53. [PMID: 38741153 DOI: 10.1186/s13321-024-00849-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 04/28/2024] [Indexed: 05/16/2024] Open
Abstract
Molecular fingerprints are indispensable tools in cheminformatics. However, stereochemistry is generally not considered, which is problematic for large molecules which are almost all chiral. Herein we report MAP4C, a chiral version of our previously reported fingerprint MAP4, which lists MinHashes computed from character strings containing the SMILES of all pairs of circular substructures up to a diameter of four bonds and the shortest topological distance between their central atoms. MAP4C includes the Cahn-Ingold-Prelog (CIP) annotation (R, S, r or s) whenever the chiral atom is the center of a circular substructure, a question mark for undefined stereocenters, and double bond cis-trans information if specified. MAP4C performs slightly better than the achiral MAP4, ECFP and AP fingerprints in non-stereoselective virtual screening benchmarks. Furthermore, MAP4C distinguishes between stereoisomers in chiral molecules from small molecule drugs to large natural products and peptides comprising thousands of diastereomers, with a degree of distinction smaller than between structural isomers and proportional to the number of chirality changes. Due to its excellent performance across diverse molecular classes and its ability to handle stereochemistry, MAP4C is recommended as a generally applicable chiral molecular fingerprint. SCIENTIFIC CONTRIBUTION: The ability of our chiral fingerprint MAP4C to handle stereoisomers from small molecules to large natural products and peptides is unprecedented and opens the way for cheminformatics to include stereochemistry as an important molecular parameter across all fields of molecular design.
Collapse
Affiliation(s)
- Markus Orsi
- Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland
| | - Jean-Louis Reymond
- Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.
| |
Collapse
|
6
|
Kumar N, Acharya V. Advances in machine intelligence-driven virtual screening approaches for big-data. Med Res Rev 2024; 44:939-974. [PMID: 38129992 DOI: 10.1002/med.21995] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 07/15/2023] [Accepted: 10/29/2023] [Indexed: 12/23/2023]
Abstract
Virtual screening (VS) is an integral and ever-evolving domain of drug discovery framework. The VS is traditionally classified into ligand-based (LB) and structure-based (SB) approaches. Machine intelligence or artificial intelligence has wide applications in the drug discovery domain to reduce time and resource consumption. In combination with machine intelligence algorithms, VS has emerged into revolutionarily progressive technology that learns within robust decision orders for data curation and hit molecule screening from large VS libraries in minutes or hours. The exponential growth of chemical and biological data has evolved as "big-data" in the public domain demands modern and advanced machine intelligence-driven VS approaches to screen hit molecules from ultra-large VS libraries. VS has evolved from an individual approach (LB and SB) to integrated LB and SB techniques to explore various ligand and target protein aspects for the enhanced rate of appropriate hit molecule prediction. Current trends demand advanced and intelligent solutions to handle enormous data in drug discovery domain for screening and optimizing hits or lead with fewer or no false positive hits. Following the big-data drift and tremendous growth in computational architecture, we presented this review. Here, the article categorized and emphasized individual VS techniques, detailed literature presented for machine learning implementation, modern machine intelligence approaches, and limitations and deliberated the future prospects.
Collapse
Affiliation(s)
- Neeraj Kumar
- Artificial Intelligence for Computational Biology Lab (AICoB), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research, Ghaziabad, India
| | - Vishal Acharya
- Artificial Intelligence for Computational Biology Lab (AICoB), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology, Palampur, Himachal Pradesh, India
- Academy of Scientific and Innovative Research, Ghaziabad, India
| |
Collapse
|
7
|
Yao S, Song J, Jia L, Cheng L, Zhong Z, Song M, Feng Z. Fast and effective molecular property prediction with transferability map. Commun Chem 2024; 7:85. [PMID: 38632308 PMCID: PMC11024153 DOI: 10.1038/s42004-024-01169-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 04/05/2024] [Indexed: 04/19/2024] Open
Abstract
Effective transfer learning for molecular property prediction has shown considerable strength in addressing insufficient labeled molecules. Many existing methods either disregard the quantitative relationship between source and target properties, risking negative transfer, or require intensive training on target tasks. To quantify transferability concerning task-relatedness, we propose Principal Gradient-based Measurement (PGM) for transferring molecular property prediction ability. First, we design an optimization-free scheme to calculate a principal gradient for approximating the direction of model optimization on a molecular property prediction dataset. We have analyzed the close connection between the principal gradient and model optimization through mathematical proof. PGM measures the transferability as the distance between the principal gradient obtained from the source dataset and that derived from the target dataset. Then, we perform PGM on various molecular property prediction datasets to build a quantitative transferability map for source dataset selection. Finally, we evaluate PGM on multiple combinations of transfer learning tasks across 12 benchmark molecular property prediction datasets and demonstrate that it can serve as fast and effective guidance to improve the performance of a target task. This work contributes to more efficient discovery of drugs, materials, and catalysts by offering a task-relatedness quantification prior to transfer learning and understanding the relationship between chemical properties.
Collapse
Affiliation(s)
- Shaolun Yao
- Collaborative Innovation Center of Artificial Intelligence by MOE and Zhejiang Provincial Government, Zhejiang University, 310027, Hangzhou, China
- College of Computer Science and Technology, Zhejiang University, 310027, Hangzhou, China
- Shanghai Institute for Advanced Study of Zhejiang University, 201203, Shanghai, China
| | - Jie Song
- Shanghai Institute for Advanced Study of Zhejiang University, 201203, Shanghai, China
- School of Software Technology, Zhejiang University, 315048, Ningbo, China
| | - Lingxiang Jia
- College of Computer Science and Technology, Zhejiang University, 310027, Hangzhou, China
| | - Lechao Cheng
- School of Computer Science and Information Engineering, Hefei University of Technology, 230009, Hefei, China
| | - Zipeng Zhong
- College of Computer Science and Technology, Zhejiang University, 310027, Hangzhou, China
| | - Mingli Song
- College of Computer Science and Technology, Zhejiang University, 310027, Hangzhou, China
- Shanghai Institute for Advanced Study of Zhejiang University, 201203, Shanghai, China
| | - Zunlei Feng
- Shanghai Institute for Advanced Study of Zhejiang University, 201203, Shanghai, China.
- School of Software Technology, Zhejiang University, 315048, Ningbo, China.
| |
Collapse
|
8
|
Boldini D, Ballabio D, Consonni V, Todeschini R, Grisoni F, Sieber SA. Effectiveness of molecular fingerprints for exploring the chemical space of natural products. J Cheminform 2024; 16:35. [PMID: 38528548 DOI: 10.1186/s13321-024-00830-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/17/2024] [Indexed: 03/27/2024] Open
Abstract
Natural products are a diverse class of compounds with promising biological properties, such as high potency and excellent selectivity. However, they have different structural motifs than typical drug-like compounds, e.g., a wider range of molecular weight, multiple stereocenters and higher fraction of sp3-hybridized carbons. This makes the encoding of natural products via molecular fingerprints difficult, thus restricting their use in cheminformatics studies. To tackle this issue, we explored over 30 years of research to systematically evaluate which molecular fingerprint provides the best performance on the natural product chemical space. We considered 20 molecular fingerprints from four different sources, which we then benchmarked on over 100,000 unique natural products from the COCONUT (COlleCtion of Open Natural prodUcTs) and CMNPD (Comprehensive Marine Natural Products Database) databases. Our analysis focused on the correlation between different fingerprints and their classification performance on 12 bioactivity prediction datasets. Our results show that different encodings can provide fundamentally different views of the natural product chemical space, leading to substantial differences in pairwise similarity and performance. While Extended Connectivity Fingerprints are the de-facto option to encoding drug-like compounds, other fingerprints resulted to match or outperform them for bioactivity prediction of natural products. These results highlight the need to evaluate multiple fingerprinting algorithms for optimal performance and suggest new areas of research. Finally, we provide an open-source Python package for computing all molecular fingerprints considered in the study, as well as data and scripts necessary to reproduce the results, at https://github.com/dahvida/NP_Fingerprints .
Collapse
Affiliation(s)
- Davide Boldini
- TUM School of Natural Sciences, Department of Bioscience, Technical University of Munich, Center for Functional Protein Assemblies (CPA), 85748, Garching bei München, Germany.
| | - Davide Ballabio
- Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.zza Della Scienza, 1, 20126, Milan, Italy
| | - Viviana Consonni
- Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.zza Della Scienza, 1, 20126, Milan, Italy
| | - Roberto Todeschini
- Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.zza Della Scienza, 1, 20126, Milan, Italy
| | - Francesca Grisoni
- Institute for Complex Molecular Systems and Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, Netherlands
- Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Utrecht, Netherlands
| | - Stephan A Sieber
- TUM School of Natural Sciences, Department of Bioscience, Technical University of Munich, Center for Functional Protein Assemblies (CPA), 85748, Garching bei München, Germany
| |
Collapse
|
9
|
Shen T, Li S, Wang XS, Wang D, Wu S, Xia J, Zhang L. Deep reinforcement learning enables better bias control in benchmark for virtual screening. Comput Biol Med 2024; 171:108165. [PMID: 38402838 DOI: 10.1016/j.compbiomed.2024.108165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 02/07/2024] [Accepted: 02/14/2024] [Indexed: 02/27/2024]
Abstract
Virtual screening (VS) has been incorporated into the paradigm of modern drug discovery. This field is now undergoing a new wave of revolution driven by artificial intelligence and more specifically, machine learning (ML). In terms of those out-of-the-box datasets for model training or benchmarking, their data volume and applicability domain are limited. They are suffering from the biases constantly reported in the ML application. To address these issues, we present a novel benchmark named MUBDsyn. The utilization of synthetic decoys (i.e., presumed inactives) is the main feature of MUBDsyn, where deep reinforcement learning was leveraged for bias control during decoy generation. Then, we carried out extensive validations on this new benchmark. First, we confirmed that MUBDsyn was superior to the classical benchmarks in control of domain bias, artificial enrichment bias and analogue bias. Moreover, we found that the assessment of ML models based on MUBDsyn was less biased as revealed by the analysis of asymmetric validation embedding bias. In addition, MUBDsyn showed better setting of benchmarking challenge for deep learning models compared with NRLiSt-BDB. Overall, we have proven that MUBDsyn is the close-to-ideal benchmark for VS. The computational tool is publicly available for the easy extension of MUBDsyn.
Collapse
Affiliation(s)
- Tao Shen
- State Key Laboratory of Bioactive Substance and Function of Natural Medicines, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100050, China
| | - Shan Li
- College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing, 211106, China
| | - Xiang Simon Wang
- Artificial Intelligence and Drug Discovery Core Laboratory for District of Columbia Center for AIDS Research (DC CFAR), Department of Pharmaceutical Sciences, College of Pharmacy, Howard University, USA
| | - Dongmei Wang
- State Key Laboratory of Bioactive Substance and Function of Natural Medicines, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100050, China.
| | - Song Wu
- State Key Laboratory of Bioactive Substance and Function of Natural Medicines, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100050, China.
| | - Jie Xia
- State Key Laboratory of Bioactive Substance and Function of Natural Medicines, Institute of Materia Medica, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100050, China.
| | - Liangren Zhang
- State Key Laboratory of Natural and Biomimetic Drugs, School of Pharmaceutical Sciences, Peking University, Beijing, 100191, China
| |
Collapse
|
10
|
Zhang L, Li M, Zhang D, Zhang S, Zhang L, Wang X, Qian Z. Developmental neurotoxicity (DNT) QSAR combination prediction model establishment and structural characteristics interpretation. Toxicol Res (Camb) 2024; 13:tfad116. [PMID: 38178999 PMCID: PMC10762666 DOI: 10.1093/toxres/tfad116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 09/14/2023] [Accepted: 11/08/2023] [Indexed: 01/06/2024] Open
Abstract
With the incidence of neurodevelopmental disorders on the rise, it is imperative to screen and evaluate developmental neurotoxicity (DNT) compounds from a large number of environmental chemicals and understand their mechanisms. In this study, DNT qualitative structure-activity relationship (QSAR) study was carried out for the first time based on DNT data of mammals and structural characterization of DNT compounds was preliminarily illustrated. Five different classification algorithms and two feature selection methods were used to construct prediction models. The best model had good predictive ability on the external test set, but a small application domain (AD). Through combining of three different models, both MCC and AD values were improved. Furthermore, electronical properties, van der Waals volume-related properties and S, Cl or P containing substructure were found to be associated with DNT through modeling descriptors analysis and structure alerts (SAs) identification. This study lays a foundation for further DNT prediction of environmental exposures in human and contributes to the understanding of DNT mechanism.
Collapse
Affiliation(s)
- Lu Zhang
- Department of Toxicology, Tianjin Centers for Disease Control and Prevention, Tianjin 300011, China
| | - Min Li
- Department of Toxicology, Tianjin Centers for Disease Control and Prevention, Tianjin 300011, China
| | - Dalong Zhang
- Department of Toxicology, Tianjin Centers for Disease Control and Prevention, Tianjin 300011, China
| | - Shujing Zhang
- Department of Toxicology, Tianjin Centers for Disease Control and Prevention, Tianjin 300011, China
| | - Li Zhang
- Department of Toxicology, Tianjin Centers for Disease Control and Prevention, Tianjin 300011, China
| | - Xiaojun Wang
- Department of Toxicology, Tianjin Centers for Disease Control and Prevention, Tianjin 300011, China
| | - Zhiyong Qian
- Department of Toxicology, Tianjin Centers for Disease Control and Prevention, Tianjin 300011, China
| |
Collapse
|
11
|
Zhou H, Skolnick J. FRAGSITE2: A structure and fragment-based approach for virtual ligand screening. Protein Sci 2024; 33:e4869. [PMID: 38100293 PMCID: PMC10751727 DOI: 10.1002/pro.4869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Revised: 12/06/2023] [Accepted: 12/09/2023] [Indexed: 12/17/2023]
Abstract
Protein function annotation and drug discovery often involve finding small molecule binders. In the early stages of drug discovery, virtual ligand screening (VLS) is frequently applied to identify possible hits before experimental testing. While our recent ligand homology modeling (LHM)-machine learning VLS method FRAGSITE outperformed approaches that combined traditional docking to generate protein-ligand poses and deep learning scoring functions to rank ligands, a more robust approach that could identify a more diverse set of binding ligands is needed. Here, we describe FRAGSITE2 that shows significant improvement on protein targets lacking known small molecule binders and no confident LHM identified template ligands when benchmarked on two commonly used VLS datasets: For both the DUD-E set and DEKOIS2.0 set and ligands having a Tanimoto coefficient (TC) < 0.7 to the template ligands, the 1% enrichment factor (EF1% ) of FRAGSITE2 is significantly better than those for FINDSITEcomb2.0 , an earlier LHM algorithm. For the DUD-E set, FRAGSITE2 also shows better ROC enrichment factor and AUPR (area under the precision-recall curve) than the deep learning DenseFS scoring function. Comparison with the RF-score-VS on the 76 target subset of DEKOIS2.0 and a TC < 0.99 to training DUD-E ligands, FRAGSITE2 has double the EF1% . Its boosted tree regression method provides for more robust performance than a deep learning multiple layer perceptron method. When compared with the pretrained language model for protein target features, FRAGSITE2 also shows much better performance. Thus, FRAGSITE2 is a promising approach that can discover novel hits for protein targets. FRAGSITE2's web service is freely available to academic users at http://sites.gatech.edu/cssb/FRAGSITE2.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of TechnologyAtlantaGeorgiaUSA
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of TechnologyAtlantaGeorgiaUSA
| |
Collapse
|
12
|
Paykan Heyrati M, Ghorbanali Z, Akbari M, Pishgahi G, Zare-Mirakabad F. BioAct-Het: A Heterogeneous Siamese Neural Network for Bioactivity Prediction Using Novel Bioactivity Representation. ACS OMEGA 2023; 8:44757-44772. [PMID: 38046344 PMCID: PMC10688196 DOI: 10.1021/acsomega.3c05778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 10/13/2023] [Accepted: 10/24/2023] [Indexed: 12/05/2023]
Abstract
Drug failure during experimental procedures due to low bioactivity presents a significant challenge. To mitigate this risk and enhance compound bioactivities, predicting bioactivity classes during lead optimization is essential. The existing studies on structure-activity relationships have highlighted the connection between the chemical structures of compounds and their bioactivity. However, these studies often overlook the intricate relationship between drugs and bioactivity, which encompasses multiple factors beyond the chemical structure alone. To address this issue, we propose the BioAct-Het model, employing a heterogeneous siamese neural network to model the complex relationship between drugs and bioactivity classes, bringing them into a unified latent space. In particular, we introduce a novel representation for the bioactivity classes, called Bio-Prof, and enhance the original bioactivity data sets to tackle data scarcity. These innovative approaches resulted in our model outperforming the previous ones. The evaluation of BioAct-Het is conducted through three distinct strategies: association-based, bioactivity class-based, and compound-based. The association-based strategy utilizes supervised learning classification, while the bioactivity class-based strategy adopts a retrospective study evaluation approach. On the other hand, the compound-based strategy demonstrates similarities to the concept of meta-learning. Furthermore, the model's effectiveness in addressing real-world problems is analyzed through a case study on the application of vancomycin and oseltamivir for COVID-19 treatment as well as molnupiravir's potential efficacy in treating COVID-19 patients. The data and code underlying this article are available on https://github.com/CBRC-lab/BioAct-Het. However, data sets were derived from sources in the public domain.
Collapse
Affiliation(s)
- Mehdi Paykan Heyrati
- Computational
Biology Research Center (CBRC), Department of Mathematics and Computer
Science, Amirkabir University of Technology, Tehran 1591634311, Iran
| | - Zahra Ghorbanali
- Computational
Biology Research Center (CBRC), Department of Mathematics and Computer
Science, Amirkabir University of Technology, Tehran 1591634311, Iran
| | - Mohammad Akbari
- Computational
Biology Research Center (CBRC), Department of Mathematics and Computer
Science, Amirkabir University of Technology, Tehran 1591634311, Iran
| | - Ghasem Pishgahi
- Students’
Scientific Research Center (SSRC), Tehran
University of Medical Sciences, Tehran 1416753955, Iran
| | - Fatemeh Zare-Mirakabad
- Computational
Biology Research Center (CBRC), Department of Mathematics and Computer
Science, Amirkabir University of Technology, Tehran 1591634311, Iran
| |
Collapse
|
13
|
Xu F, Yang Z, Wang L, Meng D, Long J. MESPool: Molecular Edge Shrinkage Pooling for hierarchical molecular representation learning and property prediction. Brief Bioinform 2023; 25:bbad423. [PMID: 38048081 PMCID: PMC10753536 DOI: 10.1093/bib/bbad423] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/18/2023] [Accepted: 10/29/2023] [Indexed: 12/05/2023] Open
Abstract
Identifying task-relevant structures is important for molecular property prediction. In a graph neural network (GNN), graph pooling can group nodes and hierarchically represent the molecular graph. However, previous pooling methods either drop out node information or lose the connection of the original graph; therefore, it is difficult to identify continuous subtructures. Importantly, they lacked interpretability on molecular graphs. To this end, we proposed a novel Molecular Edge Shrinkage Pooling (MESPool) method, which is based on edges (or chemical bonds). MESPool preserves crucial edges and shrinks others inside the functional groups and is able to search for key structures without breaking the original connection. We compared MESPool with various well-known pooling methods on different benchmarks and showed that MESPool outperforms the previous methods. Furthermore, we explained the rationality of MESPool on some datasets, including a COVID-19 drug dataset.
Collapse
Affiliation(s)
- Fanding Xu
- School of Life Science and Technology, Xi’an Jiaotong University, 710049 Shaanxi, China
| | - Zhiwei Yang
- School of Physics, Xi’an Jiaotong University, 710049 Shaanxi, China
| | - Lizhuo Wang
- School of Life Science and Technology, Xi’an Jiaotong University, 710049 Shaanxi, China
| | - Deyu Meng
- Rearch Institute for Mathematics and Mathematical Technology, Xi’an Jiaotong University, 710049 Shaanxi, China
- School of Mathematics and Statistics, Henan University, 475004 Henan, China
| | - Jiangang Long
- School of Life Science and Technology, Xi’an Jiaotong University, 710049 Shaanxi, China
| |
Collapse
|
14
|
Shen C, Luo J, Xia K. Molecular geometric deep learning. CELL REPORTS METHODS 2023; 3:100621. [PMID: 37875121 PMCID: PMC10694498 DOI: 10.1016/j.crmeth.2023.100621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 06/16/2023] [Accepted: 09/28/2023] [Indexed: 10/26/2023]
Abstract
Molecular representation learning plays an important role in molecular property prediction. Existing molecular property prediction models rely on the de facto standard of covalent-bond-based molecular graphs for representing molecular topology at the atomic level and totally ignore the non-covalent interactions within the molecule. In this study, we propose a molecular geometric deep learning model to predict the properties of molecules that aims to comprehensively consider the information of covalent and non-covalent interactions of molecules. The essential idea is to incorporate a more general molecular representation into geometric deep learning (GDL) models. We systematically test molecular GDL (Mol-GDL) on fourteen commonly used benchmark datasets. The results show that Mol-GDL can achieve a better performance than state-of-the-art (SOTA) methods. Extensive tests have demonstrated the important role of non-covalent interactions in molecular property prediction and the effectiveness of Mol-GDL models.
Collapse
Affiliation(s)
- Cong Shen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410000, China; School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410000, China.
| | - Kelin Xia
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore.
| |
Collapse
|
15
|
Kwon H, Ali ZA, Wong BM. Harnessing Semi-Supervised Machine Learning to Automatically Predict Bioactivities of Per- and Polyfluoroalkyl Substances (PFASs). ENVIRONMENTAL SCIENCE & TECHNOLOGY LETTERS 2023; 10:1017-1022. [PMID: 38025956 PMCID: PMC10653214 DOI: 10.1021/acs.estlett.2c00530] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 08/23/2022] [Indexed: 12/01/2023]
Abstract
Many per- and polyfluoroalkyl substances (PFASs) pose significant health hazards due to their bioactive and persistent bioaccumulative properties. However, assessing the bioactivities of PFASs is both time-consuming and costly due to the sheer number and expense of in vivo and in vitro biological experiments. To this end, we harnessed new unsupervised/semi-supervised machine learning models to automatically predict bioactivities of PFASs in various human biological targets, including enzymes, genes, proteins, and cell lines. Our semi-supervised metric learning models were used to predict the bioactivity of PFASs found in the recent Organisation of Economic Co-operation and Development (OECD) report list, which contains 4730 PFASs used in a broad range of industries and consumers. Our work provides the first semi-supervised machine learning study of structure-activity relationships for predicting possible bioactivities in a variety of PFAS species.
Collapse
Affiliation(s)
- Hyuna Kwon
- Department
of Chemical & Environmental Engineering, University of California-Riverside, Riverside, California 92521, United States
| | - Zulfikhar A. Ali
- Department
of Physics & Astronomy, University of
California-Riverside, Riverside, California 92521, United States
| | - Bryan M. Wong
- Department
of Chemical & Environmental Engineering, University of California-Riverside, Riverside, California 92521, United States
- Department
of Physics & Astronomy, University of
California-Riverside, Riverside, California 92521, United States
| |
Collapse
|
16
|
Libouban PY, Aci-Sèche S, Gómez-Tamayo JC, Tresadern G, Bonnet P. The Impact of Data on Structure-Based Binding Affinity Predictions Using Deep Neural Networks. Int J Mol Sci 2023; 24:16120. [PMID: 38003312 PMCID: PMC10671244 DOI: 10.3390/ijms242216120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/30/2023] [Accepted: 11/01/2023] [Indexed: 11/26/2023] Open
Abstract
Artificial intelligence (AI) has gained significant traction in the field of drug discovery, with deep learning (DL) algorithms playing a crucial role in predicting protein-ligand binding affinities. Despite advancements in neural network architectures, system representation, and training techniques, the performance of DL affinity prediction has reached a plateau, prompting the question of whether it is truly solved or if the current performance is overly optimistic and reliant on biased, easily predictable data. Like other DL-related problems, this issue seems to stem from the training and test sets used when building the models. In this work, we investigate the impact of several parameters related to the input data on the performance of neural network affinity prediction models. Notably, we identify the size of the binding pocket as a critical factor influencing the performance of our statistical models; furthermore, it is more important to train a model with as much data as possible than to restrict the training to only high-quality datasets. Finally, we also confirm the bias in the typically used current test sets. Therefore, several types of evaluation and benchmarking are required to understand models' decision-making processes and accurately compare the performance of models.
Collapse
Affiliation(s)
- Pierre-Yves Libouban
- Institute of Organic and Analytical Chemistry (ICOA), UMR7311, Université d’Orléans, CNRS, Pôle de Chimie rue de Chartres, 45067 Orléans, CEDEX 2, France; (P.-Y.L.); (S.A.-S.)
| | - Samia Aci-Sèche
- Institute of Organic and Analytical Chemistry (ICOA), UMR7311, Université d’Orléans, CNRS, Pôle de Chimie rue de Chartres, 45067 Orléans, CEDEX 2, France; (P.-Y.L.); (S.A.-S.)
| | - Jose Carlos Gómez-Tamayo
- Computational Chemistry, Janssen Research & Development, Janssen Pharmaceutica N. V., B-2340 Beerse, Belgium; (J.C.G.-T.); (G.T.)
| | - Gary Tresadern
- Computational Chemistry, Janssen Research & Development, Janssen Pharmaceutica N. V., B-2340 Beerse, Belgium; (J.C.G.-T.); (G.T.)
| | - Pascal Bonnet
- Institute of Organic and Analytical Chemistry (ICOA), UMR7311, Université d’Orléans, CNRS, Pôle de Chimie rue de Chartres, 45067 Orléans, CEDEX 2, France; (P.-Y.L.); (S.A.-S.)
| |
Collapse
|
17
|
Tran-Nguyen VK, Junaid M, Simeon S, Ballester PJ. A practical guide to machine-learning scoring for structure-based virtual screening. Nat Protoc 2023; 18:3460-3511. [PMID: 37845361 DOI: 10.1038/s41596-023-00885-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 07/03/2023] [Indexed: 10/18/2023]
Abstract
Structure-based virtual screening (SBVS) via docking has been used to discover active molecules for a range of therapeutic targets. Chemical and protein data sets that contain integrated bioactivity information have increased both in number and in size. Artificial intelligence and, more concretely, its machine-learning (ML) branch, including deep learning, have effectively exploited these data sets to build scoring functions (SFs) for SBVS against targets with an atomic-resolution 3D model (e.g., generated by X-ray crystallography or predicted by AlphaFold2). Often outperforming their generic and non-ML counterparts, target-specific ML-based SFs represent the state of the art for SBVS. Here, we present a comprehensive and user-friendly protocol to build and rigorously evaluate these new SFs for SBVS. This protocol is organized into four sections: (i) using a public benchmark of a given target to evaluate an existing generic SF; (ii) preparing experimental data for a target from public repositories; (iii) partitioning data into a training set and a test set for subsequent target-specific ML modeling; and (iv) generating and evaluating target-specific ML SFs by using the prepared training-test partitions. All necessary code and input/output data related to three example targets (acetylcholinesterase, HMG-CoA reductase, and peroxisome proliferator-activated receptor-α) are available at https://github.com/vktrannguyen/MLSF-protocol , can be run by using a single computer within 1 week and make use of easily accessible software/programs (e.g., Smina, CNN-Score, RF-Score-VS and DeepCoy) and web resources. Our aim is to provide practical guidance on how to augment training data to enhance SBVS performance, how to identify the most suitable supervised learning algorithm for a data set, and how to build an SF with the highest likelihood of discovering target-active molecules within a given compound library.
Collapse
Affiliation(s)
| | - Muhammad Junaid
- Centre de Recherche en Cancérologie de Marseille, Marseille, France
| | - Saw Simeon
- Centre de Recherche en Cancérologie de Marseille, Marseille, France
| | | |
Collapse
|
18
|
Beckers M, Sturm N, Sirockin F, Fechner N, Stiefl N. Prediction of Small-Molecule Developability Using Large-Scale In Silico ADMET Models. J Med Chem 2023; 66:14047-14060. [PMID: 37815201 DOI: 10.1021/acs.jmedchem.3c01083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/11/2023]
Abstract
Early in silico assessment of the potential of a series of compounds to deliver a drug is one of the major challenges in computer-assisted drug design. The goal is to identify the right chemical series of compounds out of a large chemical space to then subsequently prioritize the molecules with the highest potential to become a drug. Although multiple approaches to assess compounds have been developed over decades, the quality of these predictors is often not good enough and compounds that agree with the respective estimates are not necessarily druglike. Here, we report a novel deep learning approach that leverages large-scale predictions of ∼100 ADMET assays to assess the potential of a compound to become a relevant drug candidate. The resulting score, which we termed bPK score, substantially outperforms previous approaches and showed strong discriminative performance on data sets where previous approaches did not.
Collapse
Affiliation(s)
- Maximilian Beckers
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Noé Sturm
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Finton Sirockin
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| |
Collapse
|
19
|
Li B, Lin M, Chen T, Wang L. FG-BERT: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Brief Bioinform 2023; 24:bbad398. [PMID: 37930026 DOI: 10.1093/bib/bbad398] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 09/25/2023] [Accepted: 10/14/2023] [Indexed: 11/07/2023] Open
Abstract
Artificial intelligence-based molecular property prediction plays a key role in molecular design such as bioactive molecules and functional materials. In this study, we propose a self-supervised pretraining deep learning (DL) framework, called functional group bidirectional encoder representations from transformers (FG-BERT), pertained based on ~1.45 million unlabeled drug-like molecules, to learn meaningful representation of molecules from function groups. The pretrained FG-BERT framework can be fine-tuned to predict molecular properties. Compared to state-of-the-art (SOTA) machine learning and DL methods, we demonstrate the high performance of FG-BERT in evaluating molecular properties in tasks involving physical chemistry, biophysics and physiology across 44 benchmark datasets. In addition, FG-BERT utilizes attention mechanisms to focus on FG features that are critical to the target properties, thereby providing excellent interpretability for downstream training tasks. Collectively, FG-BERT does not require any artificially crafted features as input and has excellent interpretability, providing an out-of-the-box framework for developing SOTA models for a variety of molecule (especially for drug) discovery tasks.
Collapse
Affiliation(s)
- Biaoshun Li
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Mujie Lin
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Tiegen Chen
- Zhongshan Institute for Drug Discovery, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Room 109, Building C, SSIP Healthcare and Medicine Demonstration Zone, Zhongshan Tsuihang New District, Zhongshan, Guangdong, 528400, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Ministry of Education, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| |
Collapse
|
20
|
Wojtuch A, Danel T, Podlewska S, Maziarka Ł. Extended study on atomic featurization in graph neural networks for molecular property prediction. J Cheminform 2023; 15:81. [PMID: 37726841 PMCID: PMC10507875 DOI: 10.1186/s13321-023-00751-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 08/23/2023] [Indexed: 09/21/2023] Open
Abstract
Graph neural networks have recently become a standard method for analyzing chemical compounds. In the field of molecular property prediction, the emphasis is now on designing new model architectures, and the importance of atom featurization is oftentimes belittled. When contrasting two graph neural networks, the use of different representations possibly leads to incorrect attribution of the results solely to the network architecture. To better understand this issue, we compare multiple atom representations by evaluating them on the prediction of free energy, solubility, and metabolic stability using graph convolutional networks. We discover that the choice of atom representation has a significant impact on model performance and that the optimal subset of features is task-specific. Additional experiments involving more sophisticated architectures, including graph transformers, support these findings. Moreover, we demonstrate that some commonly used atom features, such as the number of neighbors or the number of hydrogens, can be easily predicted using only information about bonds and atom type, yet their explicit inclusion in the representation has a positive impact on model performance. Finally, we explain the predictions of the best-performing models to better understand how they utilize the available atomic features.
Collapse
Affiliation(s)
- Agnieszka Wojtuch
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Kraków, Poland.
| | - Tomasz Danel
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Kraków, Poland
| | - Sabina Podlewska
- Maj Institute of Pharmacology, Polish Academy of Sciences, Smętna 12, 31-343, Kraków, Poland
| | - Łukasz Maziarka
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Kraków, Poland
| |
Collapse
|
21
|
Wu Y, Ni X, Wang Z, Feng W. Enhancing drug property prediction with dual-channel transfer learning based on molecular fragment. BMC Bioinformatics 2023; 24:293. [PMID: 37479969 PMCID: PMC10360281 DOI: 10.1186/s12859-023-05413-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2023] [Accepted: 07/13/2023] [Indexed: 07/23/2023] Open
Abstract
BACKGROUND Accurate prediction of molecular property holds significance in contemporary drug discovery and medical research. Recent advances in AI-driven molecular property prediction have shown promising results. Due to the costly annotation of in vitro and in vivo experiments, transfer learning paradigm has been gaining momentum in extracting general self-supervised information to facilitate neural network learning. However, prior pretraining strategies have overlooked the necessity of explicitly incorporating domain knowledge, especially the molecular fragments, into model design, resulting in the under-exploration of the molecular semantic space. RESULTS We propose an effective model with FRagment-based dual-channEL pretraining (FREL). Equipped with molecular fragments, FREL comprehensively employs masked autoencoder and contrastive learning to learn intra- and inter-molecule agreement, respectively. We further conduct extensive experiments on ten public datasets to demonstrate its superiority over state-of-the-art models. Further investigations and interpretations manifest the underlying relationship between molecular representations and molecular properties. CONCLUSIONS Our proposed model FREL achieves state-of-the-art performance on the benchmark datasets, emphasizing the importance of incorporating molecular fragments into model design. The expressiveness of learned molecular representations is also investigated by visualization and correlation analysis. Case studies indicate that the learned molecular representations better capture the drug property variation and fragment semantics.
Collapse
Affiliation(s)
- Yue Wu
- College of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine, Jinan, China
| | - Xinran Ni
- College of Pharmacy, Shandong University of Traditional Chinese Medicine, Jinan, China
| | - Zhihao Wang
- College of Intelligence and Information Engineering, Shandong University of Traditional Chinese Medicine, Jinan, China
| | - Weike Feng
- College of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine, Jinan, China.
| |
Collapse
|
22
|
Moshkov N, Becker T, Yang K, Horvath P, Dancik V, Wagner BK, Clemons PA, Singh S, Carpenter AE, Caicedo JC. Predicting compound activity from phenotypic profiles and chemical structures. Nat Commun 2023; 14:1967. [PMID: 37031208 PMCID: PMC10082762 DOI: 10.1038/s41467-023-37570-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 03/23/2023] [Indexed: 04/10/2023] Open
Abstract
Predicting assay results for compounds virtually using chemical structures and phenotypic profiles has the potential to reduce the time and resources of screens for drug discovery. Here, we evaluate the relative strength of three high-throughput data sources-chemical structures, imaging (Cell Painting), and gene-expression profiles (L1000)-to predict compound bioactivity using a historical collection of 16,170 compounds tested in 270 assays for a total of 585,439 readouts. All three data modalities can predict compound activity for 6-10% of assays, and in combination they predict 21% of assays with high accuracy, which is a 2 to 3 times higher success rate than using a single modality alone. In practice, the accuracy of predictors could be lower and still be useful, increasing the assays that can be predicted from 37% with chemical structures alone up to 64% when combined with phenotypic data. Our study shows that unbiased phenotypic profiling can be leveraged to enhance compound bioactivity prediction to accelerate the early stages of the drug-discovery process.
Collapse
Affiliation(s)
- Nikita Moshkov
- Broad Institute of MIT and Harvard, Cambridge, USA
- Biological Research Centre, Szeged, Hungary
| | - Tim Becker
- Broad Institute of MIT and Harvard, Cambridge, USA
| | | | | | - Vlado Dancik
- Broad Institute of MIT and Harvard, Cambridge, USA
| | | | | | | | | | | |
Collapse
|
23
|
Jung S, Vatheuer H, Czodrowski P. VSFlow: an open-source ligand-based virtual screening tool. J Cheminform 2023; 15:40. [PMID: 37004101 PMCID: PMC10064649 DOI: 10.1186/s13321-023-00703-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Accepted: 02/18/2023] [Indexed: 04/03/2023] Open
Abstract
Ligand-based virtual screening is a widespread method in modern drug design. It allows for a rapid screening of large compound databases in order to identify similar structures. Here we report an open-source command line tool which includes a substructure-, fingerprint- and shape-based virtual screening. Most of the implemented features fully rely on the RDKit cheminformatics framework. VSFlow accepts a wide range of input file formats and is highly customizable. Additionally, a quick visualization of the screening results as pdf and/or pymol file is supported.
Collapse
Affiliation(s)
- Sascha Jung
- grid.5675.10000 0001 0416 9637Department of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Straße 6, 44227 Dortmund, Germany
| | - Helge Vatheuer
- grid.5675.10000 0001 0416 9637Department of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Straße 6, 44227 Dortmund, Germany
| | - Paul Czodrowski
- grid.5802.f0000 0001 1941 7111Department of Chemistry, Johannes Gutenberg University Mainz, Duesbergweg 10-14, 55128 Mainz, Germany
| |
Collapse
|
24
|
Koutroumpa NM, Papavasileiou KD, Papadiamantis AG, Melagraki G, Afantitis A. A Systematic Review of Deep Learning Methodologies Used in the Drug Discovery Process with Emphasis on In Vivo Validation. Int J Mol Sci 2023; 24:6573. [PMID: 37047543 PMCID: PMC10095548 DOI: 10.3390/ijms24076573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2022] [Revised: 03/24/2023] [Accepted: 03/28/2023] [Indexed: 04/05/2023] Open
Abstract
The discovery and development of new drugs are extremely long and costly processes. Recent progress in artificial intelligence has made a positive impact on the drug development pipeline. Numerous challenges have been addressed with the growing exploitation of drug-related data and the advancement of deep learning technology. Several model frameworks have been proposed to enhance the performance of deep learning algorithms in molecular design. However, only a few have had an immediate impact on drug development since computational results may not be confirmed experimentally. This systematic review aims to summarize the different deep learning architectures used in the drug discovery process and are validated with further in vivo experiments. For each presented study, the proposed molecule or peptide that has been generated or identified by the deep learning model has been biologically evaluated in animal models. These state-of-the-art studies highlight that even if artificial intelligence in drug discovery is still in its infancy, it has great potential to accelerate the drug discovery cycle, reduce the required costs, and contribute to the integration of the 3R (Replacement, Reduction, Refinement) principles. Out of all the reviewed scientific articles, seven algorithms were identified: recurrent neural networks, specifically, long short-term memory (LSTM-RNNs), Autoencoders (AEs) and their Wasserstein Autoencoders (WAEs) and Variational Autoencoders (VAEs) variants; Convolutional Neural Networks (CNNs); Direct Message Passing Neural Networks (D-MPNNs); and Multitask Deep Neural Networks (MTDNNs). LSTM-RNNs were the most used architectures with molecules or peptide sequences as inputs.
Collapse
Affiliation(s)
- Nikoletta-Maria Koutroumpa
- Department of ChemoInformatics, NovaMechanics Ltd., Nicosia 1070, Cyprus
- School of Chemical Engineering, National Technical University of Athens, 157 80 Athens, Greece
- Division of Data Driven Innovation, Entelos Institute, Larnaca 6059, Cyprus
| | - Konstantinos D. Papavasileiou
- Department of ChemoInformatics, NovaMechanics Ltd., Nicosia 1070, Cyprus
- Division of Data Driven Innovation, Entelos Institute, Larnaca 6059, Cyprus
- Department of ChemoInformatics, NovaMechanics MIKE., 185 45 Piraeus, Greece
| | - Anastasios G. Papadiamantis
- Department of ChemoInformatics, NovaMechanics Ltd., Nicosia 1070, Cyprus
- Division of Data Driven Innovation, Entelos Institute, Larnaca 6059, Cyprus
| | - Georgia Melagraki
- Division of Physical Sciences & Applications, Hellenic Military Academy, 166 73 Vari, Greece
| | - Antreas Afantitis
- Department of ChemoInformatics, NovaMechanics Ltd., Nicosia 1070, Cyprus
- Division of Data Driven Innovation, Entelos Institute, Larnaca 6059, Cyprus
- Department of ChemoInformatics, NovaMechanics MIKE., 185 45 Piraeus, Greece
| |
Collapse
|
25
|
Ju W, Liu Z, Qin Y, Feng B, Wang C, Guo Z, Luo X, Zhang M. Few-shot Molecular Property Prediction via Hierarchically Structured Learning on Relation Graphs. Neural Netw 2023; 163:122-131. [PMID: 37037059 DOI: 10.1016/j.neunet.2023.03.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 01/25/2023] [Accepted: 03/22/2023] [Indexed: 04/12/2023]
Abstract
This paper studies few-shot molecular property prediction, which is a fundamental problem in cheminformatics and drug discovery. More recently, graph neural network based model has gradually become the theme of molecular property prediction. However, there is a natural deficiency for existing methods, that is, the scarcity of molecules with desired properties, which makes it hard to build an effective predictive model. In this paper, we propose a novel framework called Hierarchically Structured Learning on Relation Graphs (HSL-RG) for molecular property prediction, which explores the structural semantics of a molecule from both global-level and local-level granularities. Technically, we first leverage graph kernels to construct relation graphs to globally communicate molecular structural knowledge from neighboring molecules and then design self-supervised learning signals of structure optimization to locally learn transformation-invariant representations from molecules themselves. Moreover, we propose a task-adaptive meta-learning algorithm to provide meta knowledge customization for different tasks in few-shot scenarios. Experiments on multiple real-life benchmark datasets show that HSL-RG is superior to existing state-of-the-art approaches.
Collapse
Affiliation(s)
- Wei Ju
- National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, 100871, China
| | - Zequn Liu
- National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, 100871, China
| | - Yifang Qin
- School of EECS, Peking University, Beijing, 100871, China
| | - Bin Feng
- National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, 100871, China
| | - Chen Wang
- College of Chemistry, Nankai University, Tianjin, 300071, China
| | - Zhihui Guo
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Xiao Luo
- Department of Computer Science, University of California, Los Angeles, 90024, USA.
| | - Ming Zhang
- National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, 100871, China.
| |
Collapse
|
26
|
Kwon Y, Park S, Lee J, Kang J, Lee HJ, Kim W. BEAR: A Novel Virtual Screening Method Based on Large-Scale Bioactivity Data. J Chem Inf Model 2023; 63:1429-1437. [PMID: 36821004 DOI: 10.1021/acs.jcim.2c01300] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
Data-driven drug discovery exploits a comprehensive set of big data to provide an efficient path for the development of new drugs. Currently, publicly available bioassay data sets provide extensive information regarding the bioactivity profiles of millions of compounds. Using these large-scale drug screening data sets, we developed a novel in silico method to virtually screen hit compounds against protein targets, named BEAR (Bioactive compound Enrichment by Assay Repositioning). The underlying idea of BEAR is to reuse bioassay data for predicting hit compounds for targets other than their originally intended purposes, i.e., "assay repositioning". The BEAR approach differs from conventional virtual screening methods in that (1) it relies solely on bioactivity data and requires no physicochemical features of either the target or ligand. (2) Accordingly, structurally diverse candidates are predicted, allowing for scaffold hopping. (3) BEAR shows stable performance across diverse target classes, suggesting its general applicability. Large-scale cross-validation of more than a thousand targets showed that BEAR accurately predicted known ligands (median area under the curve = 0.87), proving that BEAR maintained a robust performance even in the validation set with additional constraints. In addition, a comparative analysis demonstrated that BEAR outperformed other machine learning models, including a recent deep learning model for ABC transporter family targets. We predicted P-gp and BCRP dual inhibitors using the BEAR approach and validated the predicted candidates using in vitro assays. The intracellular accumulation effects of mitoxantrone, a well-known P-gp/BCRP dual substrate for cancer treatment, confirmed nine out of 72 dual inhibitor candidates preselected by primary cytotoxicity screening. Consequently, these nine hits are novel and potent dual inhibitors for both P-gp and BCRP, solely predicted by bioactivity profiles without relying on any structural information of targets or ligands.
Collapse
Affiliation(s)
| | - Sera Park
- KaiPharm, Seoul 03760, Republic of Korea
| | - Jaeok Lee
- College of Pharmacy, Research Institute of Pharmaceutical Science, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Jiyeon Kang
- College of Pharmacy and Graduate School of Pharmaceutical Sciences, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Hwa Jeong Lee
- College of Pharmacy and Graduate School of Pharmaceutical Sciences, Ewha Womans University, Seoul 03760, Republic of Korea
| | - Wankyu Kim
- KaiPharm, Seoul 03760, Republic of Korea.,Department of Life Sciences, College of Natural Science, Ewha Womans University, Seoul 03760, Republic of Korea
| |
Collapse
|
27
|
Mensa S, Sahin E, Tacchino F, Kl Barkoutsos P, Tavernelli I. Quantum machine learning framework for virtual screening in drug discovery: a prospective quantum advantage. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2023. [DOI: 10.1088/2632-2153/acb900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023] Open
Abstract
Abstract
Machine Learning for ligand based virtual screening (LB-VS) is an important in-silico tool for discovering new drugs in a faster and cost-effective manner, especially for emerging diseases such as COVID-19. In this paper, we propose a general-purpose framework combining a classical Support Vector Classifier algorithm with quantum kernel estimation for LB-VS on real-world databases, and we argue in favor of its prospective quantum advantage. Indeed, we heuristically prove that our quantum integrated workflow can, at least in some relevant instances, provide a tangible advantage compared to state-of-art classical algorithms operating on the same datasets, showing strong dependence on target and features selection method. Finally, we test our algorithm on IBM Quantum processors using ADRB2 and COVID-19 datasets, showing that hardware simulations provide results in line with the predicted performances and can surpass classical equivalents.
Collapse
|
28
|
Bhadwal AS, Kumar K, Kumar N. GenSMILES: An enhanced validity conscious representation for inverse design of molecules. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2023]
|
29
|
Kanakala G, Aggarwal R, Nayar D, Priyakumar UD. Latent Biases in Machine Learning Models for Predicting Binding Affinities Using Popular Data Sets. ACS OMEGA 2023; 8:2389-2397. [PMID: 36687059 PMCID: PMC9850481 DOI: 10.1021/acsomega.2c06781] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
Drug design involves the process of identifying and designing molecules that bind well to a given receptor. A vital computational component of this process is the protein-ligand interaction scoring functions that evaluate the binding ability of various molecules or ligands with a given protein receptor binding pocket reasonably accurately. With the publicly available protein-ligand binding affinity data sets in both sequential and structural forms, machine learning methods have gained traction as a top choice for developing such scoring functions. While the performance shown by these models is optimistic, there are several hidden biases present in these data sets themselves that affect the utility of such models for practical purposes such as virtual screening. In this work, we use published methods to systematically investigate several such factors or biases present in these data sets. In our analysis, we highlight the importance of considering sequence, protein-ligand interaction, and pocket structure similarity while constructing data splits and provide an explanation for good protein-only and ligand-only performances in some data sets. Through this study, we provide to the community several pointers for the design of binding affinity predictors and data sets for reliable applicability.
Collapse
Affiliation(s)
| | - Rishal Aggarwal
- International
Institute of Information Technology, Hyderabad500 032, India
| | - Divya Nayar
- Department
of Materials Science and Engineering, Indian
Institute of Technology Delhi, Hauz Khas, New Delhi110016, India
| | - U. Deva Priyakumar
- International
Institute of Information Technology, Hyderabad500 032, India
| |
Collapse
|
30
|
Abstract
The discovery of new hits through ligand-based virtual screening in drug discovery is essentially a low-data problem, as data acquisition is both difficult and expensive. The requirement for large amounts of training data hinders the application of conventional machine learning techniques to this problem domain. This work explores few-shot machine learning for hit discovery and lead optimization. We build on the state-of-the-art and introduce two new metric-based meta-learning techniques, Prototypical and Relation Networks, to this problem domain. We also explore using different embeddings, namely, extended-connectivity fingerprints (ECFP) and embeddings generated through graph convolutional networks (GCN), as inputs to neural networks for classification. This study shows that learned embeddings through GCNs consistently perform better than extended-connectivity fingerprints for toxicity and LBVS experiments. We conclude that the effectiveness of few-shot learning is highly dependent on the nature of the data. Few-shot learning models struggle to perform consistently on MUV and DUD-E data, in which the active compounds are structurally distinct. However, on Tox21 data, the few-shot models perform well, and we find that Prototypical Networks outperform the state-of-the-art, which is based on the Matching Networks architecture. Additionally, training these networks is substantially faster (up to 190%) and therefore takes a fraction of the time to train for comparable, or better, results.
Collapse
Affiliation(s)
- Daniel Vella
- Department of Artificial Intelligence, University of Malta, MsidaMSD 2080, Malta
| | - Jean-Paul Ebejer
- Department of Artificial Intelligence, University of Malta, MsidaMSD 2080, Malta.,Centre for Molecular Medicine and Biobanking, University of Malta, MsidaMSD 2080, Malta
| |
Collapse
|
31
|
Béquignon OJM, Bongers BJ, Jespers W, IJzerman AP, van der Water B, van Westen GJP. Papyrus: a large-scale curated dataset aimed at bioactivity predictions. J Cheminform 2023; 15:3. [PMID: 36609528 PMCID: PMC9824924 DOI: 10.1186/s13321-022-00672-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Accepted: 12/17/2022] [Indexed: 01/07/2023] Open
Abstract
With the ongoing rapid growth of publicly available ligand-protein bioactivity data, there is a trove of valuable data that can be used to train a plethora of machine-learning algorithms. However, not all data is equal in terms of size and quality and a significant portion of researchers' time is needed to adapt the data to their needs. On top of that, finding the right data for a research question can often be a challenge on its own. To meet these challenges, we have constructed the Papyrus dataset. Papyrus is comprised of around 60 million data points. This dataset contains multiple large publicly available datasets such as ChEMBL and ExCAPE-DB combined with several smaller datasets containing high-quality data. The aggregated data has been standardised and normalised in a manner that is suitable for machine learning. We show how data can be filtered in a variety of ways and also perform some examples of quantitative structure-activity relationship analyses and proteochemometric modelling. Our ambition is that this pruned data collection constitutes a benchmark set that can be used for constructing predictive models, while also providing an accessible data source for research.
Collapse
Affiliation(s)
- O. J. M. Béquignon
- grid.5132.50000 0001 2312 1970Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - B. J. Bongers
- grid.5132.50000 0001 2312 1970Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - W. Jespers
- grid.5132.50000 0001 2312 1970Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - A. P. IJzerman
- grid.5132.50000 0001 2312 1970Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - B. van der Water
- grid.5132.50000 0001 2312 1970Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| | - G. J. P. van Westen
- grid.5132.50000 0001 2312 1970Division of Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Leiden University, Leiden, The Netherlands
| |
Collapse
|
32
|
de Souza LP, Fernie AR. Databases and Tools to Investigate Protein-Metabolite Interactions. Methods Mol Biol 2023; 2554:231-249. [PMID: 36178629 DOI: 10.1007/978-1-0716-2624-5_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Protein-metabolite interactions (PMIs) are directly responsible for the regulation of numerous processes. From the direct regulation of enzymes to complex developmental processes intermediated by hormones, PMIs are central to understanding the molecular mechanisms of important physiological phenomena. Still, proving such interactions experimentally has proven an arduous task. We discuss here some of the current technologies contributing to expand our knowledge on PMIs, with particular emphasis on platforms and databases to explore the highly heterogenous nature of characterized PMIs, which is likely to be an essential resource on the development of new computational approaches to predict and validate interactions based on large-scale PMI screenings.
Collapse
Affiliation(s)
| | - Alisdair R Fernie
- Max-Planck-Institute of Molecular Plant Physiology, Potsdam-Golm, Germany.
| |
Collapse
|
33
|
Ogawa K, Sakamoto D, Hosoki R. Computer Science Technology in Natural Products Research: A Review of Its Applications and Implications. Chem Pharm Bull (Tokyo) 2023; 71:486-494. [PMID: 37394596 DOI: 10.1248/cpb.c23-00039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Computational approaches to drug development are rapidly growing in popularity and have been used to produce significant results. Recent developments in information science have expanded databases and chemical informatics knowledge relating to natural products. Natural products have long been well-studied, and a large number of unique structures and remarkable active substances have been reported. Analyzing accumulated natural product knowledge using emerging computational science techniques is expected to yield more new discoveries. In this article, we discuss the current state of natural product research using machine learning. The basic concepts and frameworks of machine learning are summarized. Natural product research that utilizes machine learning is described in terms of the exploration of active compounds, automatic compound design, and application to spectral data. In addition, efforts to develop drugs for intractable diseases will be addressed. Lastly, we discuss key considerations for applying machine learning in this field. This paper aims to promote progress in natural product research by presenting the current state of computational science and chemoinformatics approaches in terms of its applications, strengths, limitations, and implications for the field.
Collapse
Affiliation(s)
- Keiko Ogawa
- Laboratory of Regulatory Science, College of Pharmaceutical Sciences, Ritsumeikan University
| | - Daiki Sakamoto
- Laboratory of Regulatory Science, College of Pharmaceutical Sciences, Ritsumeikan University
| | - Rumiko Hosoki
- Laboratory of Regulatory Science, College of Pharmaceutical Sciences, Ritsumeikan University
| |
Collapse
|
34
|
Meyenburg C, Dolfus U, Briem H, Rarey M. Galileo: Three-dimensional searching in large combinatorial fragment spaces on the example of pharmacophores. J Comput Aided Mol Des 2023; 37:1-16. [PMID: 36418668 PMCID: PMC10032335 DOI: 10.1007/s10822-022-00485-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 10/17/2022] [Indexed: 11/25/2022]
Abstract
Fragment spaces are an efficient way to model large chemical spaces using a handful of small fragments and a few connection rules. The development of Enamine's REAL Space has shown that large spaces of readily available compounds may be created this way. These are several orders of magnitude larger than previous libraries. So far, searching and navigating these spaces is mostly limited to topological approaches. A way to overcome this limitation is optimization via metaheuristics which can be combined with arbitrary scoring functions. Here we present Galileo, a novel Genetic Algorithm to sample fragment spaces. We showcase Galileo in combination with a novel pharmacophore mapping approach, called Phariety, enabling 3D searches in fragment spaces. We estimate the effectiveness of the approach with a small fragment space. Furthermore, we apply Galileo to two pharmacophore searches in the REAL Space, detecting hundreds of compounds fulfilling a HSP90 and a FXIa pharmacophore.
Collapse
Affiliation(s)
- Christian Meyenburg
- Universität Hamburg, ZBH - Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146, Hamburg, Germany
| | - Uschi Dolfus
- Universität Hamburg, ZBH - Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146, Hamburg, Germany
| | - Hans Briem
- Research & Development, Pharmaceuticals, Computational Molecular Design Berlin, Bayer AG, Building S110, 711, 13342, Berlin, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146, Hamburg, Germany.
| |
Collapse
|
35
|
Blanes-Mira C, Fernández-Aguado P, de Andrés-López J, Fernández-Carvajal A, Ferrer-Montiel A, Fernández-Ballester G. Comprehensive Survey of Consensus Docking for High-Throughput Virtual Screening. Molecules 2022; 28:molecules28010175. [PMID: 36615367 PMCID: PMC9821981 DOI: 10.3390/molecules28010175] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 12/19/2022] [Accepted: 12/21/2022] [Indexed: 12/28/2022] Open
Abstract
The rapid advances of 3D techniques for the structural determination of proteins and the development of numerous computational methods and strategies have led to identifying highly active compounds in computer drug design. Molecular docking is a method widely used in high-throughput virtual screening campaigns to filter potential ligands targeted to proteins. A great variety of docking programs are currently available, which differ in the algorithms and approaches used to predict the binding mode and the affinity of the ligand. All programs heavily rely on scoring functions to accurately predict ligand binding affinity, and despite differences in performance, none of these docking programs is preferable to the others. To overcome this problem, consensus scoring methods improve the outcome of virtual screening by averaging the rank or score of individual molecules obtained from different docking programs. The successful application of consensus docking in high-throughput virtual screening highlights the need to optimize the predictive power of molecular docking methods.
Collapse
|
36
|
Chang Y, Hawkins BA, Du JJ, Groundwater PW, Hibbs DE, Lai F. A Guide to In Silico Drug Design. Pharmaceutics 2022; 15:pharmaceutics15010049. [PMID: 36678678 PMCID: PMC9867171 DOI: 10.3390/pharmaceutics15010049] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 12/16/2022] [Accepted: 12/17/2022] [Indexed: 12/28/2022] Open
Abstract
The drug discovery process is a rocky path that is full of challenges, with the result that very few candidates progress from hit compound to a commercially available product, often due to factors, such as poor binding affinity, off-target effects, or physicochemical properties, such as solubility or stability. This process is further complicated by high research and development costs and time requirements. It is thus important to optimise every step of the process in order to maximise the chances of success. As a result of the recent advancements in computer power and technology, computer-aided drug design (CADD) has become an integral part of modern drug discovery to guide and accelerate the process. In this review, we present an overview of the important CADD methods and applications, such as in silico structure prediction, refinement, modelling and target validation, that are commonly used in this area.
Collapse
Affiliation(s)
- Yiqun Chang
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Bryson A. Hawkins
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Jonathan J. Du
- Department of Biochemistry, Emory University School of Medicine, Atlanta, GA 30322, USA
| | - Paul W. Groundwater
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - David E. Hibbs
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
| | - Felcia Lai
- Sydney Pharmacy School, Faculty of Medicine and Health, The University of Sydney, Camperdown, NSW 2006, Australia
- Correspondence:
| |
Collapse
|
37
|
Pan D, Quan L, Jin Z, Chen T, Wang X, Xie J, Wu T, Lyu Q. Multisource Attention-Mechanism-Based Encoder-Decoder Model for Predicting Drug-Drug Interaction Events. J Chem Inf Model 2022; 62:6258-6270. [PMID: 36449561 DOI: 10.1021/acs.jcim.2c01112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
Many computational methods have been proposed to predict drug-drug interactions (DDIs), which can occur when combining drugs to treat various diseases, but most mainly utilize single-source features of drugs, which is inadequate for drug representation. To fill this gap, we propose two attention-mechanism-based encoder-decoder models that incorporate multisource information: one is MAEDDI, which can predict DDIs, and the other is MAEDDIE, which can make further DDI-associated event predictions for drug pairs with DDIs. To better express the drug feature, we used three encoding methods to encode the drugs, integrating the self-attention mechanism, cross-attention mechanism, and graph attention network to construct a multisource feature fusion network. Experiments showed that both MAEDDI and MAEDDIE performed better than some state-of-the-art methods in various validation attempts at different experimental tasks. The visualization analysis showed that the semantic features of drug pairs learned from our models had a good drug representation. In practice, MAEDDIE successfully screened 43 DDI events on favipiravir, an influenza antiviral drug, with a success rate of nearly 50%. Our model achieved competitive results, mainly owing to the design of sequence-based, structural, biochemical, and statistical multisource features. Moreover, different encoders constructed based on different features learn the interrelationship information between drug pairs, and the different representations of these drug pairs are incorporated to predict the target problem. All of these encoders were designed to better characterize the complex DDI relationships, allowing us to achieve high generalization in DDI and DDI-associated event predations.
Collapse
Affiliation(s)
- Deng Pan
- School of Computer Science and Technology, Soochow University, Suzhou215006, China
| | - Lijun Quan
- School of Computer Science and Technology, Soochow University, Suzhou215006, China.,Province Key Lab for Information Processing Technologies, Soochow University, Suzhou215006, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing210000, China
| | - Zhi Jin
- School of Computer Science and Technology, Soochow University, Suzhou215006, China
| | - Taoning Chen
- School of Computer Science and Technology, Soochow University, Suzhou215006, China
| | - Xuejiao Wang
- School of Computer Science and Technology, Soochow University, Suzhou215006, China
| | - Jingxin Xie
- School of Computer Science and Technology, Soochow University, Suzhou215006, China
| | - Tingfang Wu
- School of Computer Science and Technology, Soochow University, Suzhou215006, China.,Province Key Lab for Information Processing Technologies, Soochow University, Suzhou215006, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing210000, China
| | - Qiang Lyu
- School of Computer Science and Technology, Soochow University, Suzhou215006, China.,Province Key Lab for Information Processing Technologies, Soochow University, Suzhou215006, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing210000, China
| |
Collapse
|
38
|
Zhou D, Liu F, Zheng Y, Hu L, Huang T, Huang YS. Deffini: A family-specific deep neural network model for structure-based virtual screening. Comput Biol Med 2022; 151:106323. [PMID: 36436482 DOI: 10.1016/j.compbiomed.2022.106323] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 10/31/2022] [Accepted: 11/14/2022] [Indexed: 11/18/2022]
Abstract
Deep learning-based virtual screening methods have been shown to significantly improve the accuracy of traditional docking-based virtual screening methods. In this paper, we developed Deffini, a structure-based virtual screening neural network model. During training, Deffini learns protein-ligand docking poses to distinguish actives and decoys and then to predict whether a new ligand will bind to the protein target. Deffini outperformed Smina with an average AUC ROC of 0.92 and AUC PRC of 0.44 in 3-fold cross-validation on the benchmark dataset DUD-E. However, when tested on the maximum unbiased validation (MUV) dataset, Deffini achieved poor results with an average AUC ROC of 0.517. We used the family-specific training approach to train the model to improve the model performance and concluded that family-specific models performed better than the pan-family models. To explore the limits of the predictive power of the family-specific models, we constructed Kernie, a new protein kinase dataset consisting of 358 kinases. Deffini trained with the Kernie dataset outperformed all recent benchmarks on the MUV kinases, with an average AUC ROC of 0.745, which highlights the importance of quality datasets in improving the performance of deep neural network models and the importance of using family-specific models.
Collapse
Affiliation(s)
- Dixin Zhou
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China; Shenzhen Zhiyao Information Technology Co. Ltd., Shenzhen, Guangdong, China
| | - Fei Liu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China; University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Yiwen Zheng
- Department of Statistics, Donghua Univerisity, 2999 North Renmin Road, Shanghai, 201620, China
| | - Liangjian Hu
- Department of Statistics, Donghua Univerisity, 2999 North Renmin Road, Shanghai, 201620, China
| | - Tao Huang
- Shenzhen Zhiyao Information Technology Co. Ltd., Shenzhen, Guangdong, China.
| | - Yu S Huang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China; Genecast Biotechnology Co. Ltd., Wuxi, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
39
|
Morris CJ, Stern JA, Stark B, Christopherson M, Della Corte D. MILCDock: Machine Learning Enhanced Consensus Docking for Virtual Screening in Drug Discovery. J Chem Inf Model 2022; 62:5342-5350. [PMID: 36342217 DOI: 10.1021/acs.jcim.2c00705] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Molecular docking tools are regularly used to computationally identify new molecules in virtual screening for drug discovery. However, docking tools suffer from inaccurate scoring functions with widely varying performance on different proteins. To enable more accurate ranking of active over inactive ligands in virtual screening, we created a machine learning consensus docking tool, MILCDock, that uses predictions from five traditional molecular docking tools to predict the probability a ligand binds to a protein. MILCDock was trained and tested on data from both the DUD-E and LIT-PCBA docking datasets and shows improved performance over traditional molecular docking tools and other consensus docking methods on the DUD-E dataset. LIT-PCBA targets proved to be difficult for all methods tested. We also find that DUD-E data, although biased, can be effective in training machine learning tools if care is taken to avoid DUD-E's biases during training.
Collapse
Affiliation(s)
- Connor J Morris
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| | - Jacob A Stern
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States.,Department of Computer Science, Brigham Young University, Provo, Utah84602, United States
| | - Brenden Stark
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| | - Max Christopherson
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| | - Dennis Della Corte
- Department of Physics and Astronomy, Brigham Young University, Provo, Utah84602, United States
| |
Collapse
|
40
|
Cai H, Zhang H, Zhao D, Wu J, Wang L. FP-GNN: a versatile deep learning architecture for enhanced molecular property prediction. Brief Bioinform 2022; 23:6702671. [PMID: 36124766 DOI: 10.1093/bib/bbac408] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 07/28/2022] [Accepted: 08/22/2022] [Indexed: 12/14/2022] Open
Abstract
Accurate prediction of molecular properties, such as physicochemical and bioactive properties, as well as ADME/T (absorption, distribution, metabolism, excretion and toxicity) properties, remains a fundamental challenge for molecular design, especially for drug design and discovery. In this study, we advanced a novel deep learning architecture, termed FP-GNN (fingerprints and graph neural networks), which combined and simultaneously learned information from molecular graphs and fingerprints for molecular property prediction. To evaluate the FP-GNN model, we conducted experiments on 13 public datasets, an unbiased LIT-PCBA dataset and 14 phenotypic screening datasets for breast cell lines. Extensive evaluation results showed that compared to advanced deep learning and conventional machine learning algorithms, the FP-GNN algorithm achieved state-of-the-art performance on these datasets. In addition, we analyzed the influence of different molecular fingerprints, and the effects of molecular graphs and molecular fingerprints on the performance of the FP-GNN model. Analysis of the anti-noise ability and interpretation ability also indicated that FP-GNN was competitive in real-world situations. Collectively, FP-GNN algorithm can assist chemists, biologists and pharmacists in predicting and discovering better molecules with desired functions or properties.
Collapse
Affiliation(s)
- Hanxuan Cai
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Huimin Zhang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Duancheng Zhao
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Jingxing Wu
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| | - Ling Wang
- Guangdong Provincial Key Laboratory of Fermentation and Enzyme Engineering, Joint International Research Laboratory of Synthetic Biology and Medicine, Guangdong Provincial Engineering and Technology Research Center of Biopharmaceuticals, School of Biology and Biological Engineering, South China University of Technology, Guangzhou 510006, China
| |
Collapse
|
41
|
Hönig SMN, Lemmen C, Rarey M. Small molecule superposition: A comprehensive overview on pose scoring of the latest methods. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Sophia M. N. Hönig
- ZBH ‐ Center for Bioinformatics Universität Hamburg Hamburg Germany
- BioSolveIT Sankt Augustin Germany
| | | | - Matthias Rarey
- ZBH ‐ Center for Bioinformatics Universität Hamburg Hamburg Germany
| |
Collapse
|
42
|
DrugRep: an automatic virtual screening server for drug repurposing. Acta Pharmacol Sin 2022; 44:888-896. [PMID: 36216900 PMCID: PMC9549438 DOI: 10.1038/s41401-022-00996-2] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2022] [Accepted: 09/02/2022] [Indexed: 12/01/2022] Open
Abstract
Computationally identifying new targets for existing drugs has drawn much attention in drug repurposing due to its advantages over de novo drugs, including low risk, low costs, and rapid pace. To facilitate the drug repurposing computation, we constructed an automated and parameter-free virtual screening server, namely DrugRep, which performed molecular 3D structure construction, binding pocket prediction, docking, similarity comparison and binding affinity screening in a fully automatic manner. DrugRep repurposed drugs not only by receptor-based screening but also by ligand-based screening. The former automatically detected possible binding pockets of the receptor with our cavity detection approach, and then performed batch docking over drugs with a widespread docking program, AutoDock Vina. The latter explored drugs using seven well-established similarity measuring tools, including our recently developed ligand-similarity-based methods LigMate and FitDock. DrugRep utilized easy-to-use graphic interfaces for the user operation, and offered interactive predictions with state-of-the-art accuracy. We expect that this freely available online drug repurposing tool could be beneficial to the drug discovery community. The web site is http://cao.labshare.cn/drugrep/.
Collapse
|
43
|
Parastar H, Tauler R. Big (Bio)Chemical Data Mining Using Chemometric Methods: A Need for Chemists. Angew Chem Int Ed Engl 2022. [DOI: 10.1002/ange.201801134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Hadi Parastar
- Department of Chemistry Sharif University of Technology Tehran Iran
| | - Roma Tauler
- Department of Environmental Chemistry IDAEA-CSIC 08034 Barcelona Spain
| |
Collapse
|
44
|
Krasoulis A, Antonopoulos N, Pitsikalis V, Theodorakis S. DENVIS: Scalable and High-Throughput Virtual Screening Using Graph Neural Networks with Atomic and Surface Protein Pocket Features. J Chem Inf Model 2022; 62:4642-4659. [PMID: 36154119 DOI: 10.1021/acs.jcim.2c01057] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Computational methods for virtual screening can dramatically accelerate early-stage drug discovery by identifying potential hits for a specified target. Docking algorithms traditionally use physics-based simulations to address this challenge by estimating the binding orientation of a query protein-ligand pair and a corresponding binding affinity score. Over the recent years, classical and modern machine learning architectures have shown potential for outperforming traditional docking algorithms. Nevertheless, most learning-based algorithms still rely on the availability of the protein-ligand complex binding pose, typically estimated via docking simulations, which leads to a severe slowdown of the overall virtual screening process. A family of algorithms processing target information at the amino acid sequence level avoid this requirement, however, at the cost of processing protein data at a higher representation level. We introduce deep neural virtual screening (DENVIS), an end-to-end pipeline for virtual screening using graph neural networks (GNNs). By performing experiments on two benchmark databases, we show that our method performs competitively to several docking-based, machine learning-based, and hybrid docking/machine learning-based algorithms. By avoiding the intermediate docking step, DENVIS exhibits several orders of magnitude faster screening times (i.e., higher throughput) than both docking-based and hybrid models. When compared to an amino acid sequence-based machine learning model with comparable screening times, DENVIS achieves dramatically better performance. Some key elements of our approach include protein pocket modeling using a combination of atomic and surface features, the use of model ensembles, and data augmentation via artificial negative sampling during model training. In summary, DENVIS achieves competitive to state-of-the-art virtual screening performance, while offering the potential to scale to billions of molecules using minimal computational resources.
Collapse
|
45
|
A high quality, industrial data set for binding affinity prediction: performance comparison in different early drug discovery scenarios. J Comput Aided Mol Des 2022; 36:753-765. [PMID: 36153472 DOI: 10.1007/s10822-022-00478-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Accepted: 09/15/2022] [Indexed: 10/14/2022]
Abstract
We release a new, high quality data set of 1162 PDE10A inhibitors with experimentally determined binding affinities together with 77 PDE10A X-ray co-crystal structures from a Roche legacy project. This data set is used to compare the performance of different 2D- and 3D-machine learning (ML) as well as empirical scoring functions for predicting binding affinities with high throughput. We simulate use cases that are relevant in the lead optimization phase of early drug discovery. ML methods perform well at interpolation, but poorly in extrapolation scenarios-which are most relevant to a real-world application. Moreover, we find that investing into the docking workflow for binding pose generation using multi-template docking is rewarded with an improved scoring performance. A combination of 2D-ML and 3D scoring using a modified piecewise linear potential shows best overall performance, combining information on the protein environment with learning from existing SAR data.
Collapse
|
46
|
Yaseen A, Amin I, Akhter N, Ben-Hur A, Minhas F. Insights into performance evaluation of compound-protein interaction prediction methods. Bioinformatics 2022; 38:ii75-ii81. [PMID: 36124806 DOI: 10.1093/bioinformatics/btac496] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Machine-learning-based prediction of compound-protein interactions (CPIs) is important for drug design, screening and repurposing. Despite numerous recent publication with increasing methodological sophistication claiming consistent improvements in predictive accuracy, we have observed a number of fundamental issues in experiment design that produce overoptimistic estimates of model performance. RESULTS We systematically analyze the impact of several factors affecting generalization performance of CPI predictors that are overlooked in existing work: (i) similarity between training and test examples in cross-validation; (ii) synthesizing negative examples in absence of experimentally verified negative examples and (iii) alignment of evaluation protocol and performance metrics with real-world use of CPI predictors in screening large compound libraries. Using both state-of-the-art approaches by other researchers as well as a simple kernel-based baseline, we have found that effective assessment of generalization performance of CPI predictors requires careful control over similarity between training and test examples. We show that, under stringent performance assessment protocols, a simple kernel-based approach can exceed the predictive performance of existing state-of-the-art methods. We also show that random pairing for generating synthetic negative examples for training and performance evaluation results in models with better generalization in comparison to more sophisticated strategies used in existing studies. Our analyses indicate that using proposed experiment design strategies can offer significant improvements for CPI prediction leading to effective target compound screening for drug repurposing and discovery of putative chemical ligands of SARS-CoV-2-Spike and Human-ACE2 proteins. AVAILABILITY AND IMPLEMENTATION Code and supplementary material available at https://github.com/adibayaseen/HKRCPI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Adiba Yaseen
- Department of Computer and Information Sciences (DCIS), Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad 45650, Pakistan
| | - Imran Amin
- National Institute for Biotechnology and Genetic Engineering, Faisalabad 38000, Pakistan
| | - Naeem Akhter
- Department of Computer and Information Sciences (DCIS), Pakistan Institute of Engineering and Applied Sciences (PIEAS), Islamabad 45650, Pakistan
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Fayyaz Minhas
- Department of Computer Science, University of Warwick, Coventry CV4 7AL, UK
| |
Collapse
|
47
|
Lim S, Lee S, Piao Y, Choi M, Bang D, Gu J, Kim S. On modeling and utilizing chemical compound information with deep learning technologies: A task-oriented approach. Comput Struct Biotechnol J 2022; 20:4288-4304. [PMID: 36051875 PMCID: PMC9399946 DOI: 10.1016/j.csbj.2022.07.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 07/29/2022] [Accepted: 07/29/2022] [Indexed: 11/22/2022] Open
Abstract
A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.
Collapse
Affiliation(s)
- Sangsoo Lim
- Bioinformatics Institute, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sangseon Lee
- Institute of Computer Technology, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Yinhua Piao
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - MinGyu Choi
- Department of Chemistry, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Dongmin Bang
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Jeonghyeon Gu
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- MOGAM Institute for Biomedical Research, Yong-in 16924, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| |
Collapse
|
48
|
Ash JR, Hughes-Oliver JM. Confidence bands and hypothesis tests for hit enrichment curves. J Cheminform 2022; 14:50. [PMID: 35902962 PMCID: PMC9334420 DOI: 10.1186/s13321-022-00629-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 06/28/2022] [Indexed: 11/24/2022] Open
Abstract
In virtual screening for drug discovery, hit enrichment curves are widely used to assess the performance of ranking algorithms with regard to their ability to identify early enrichment. Unfortunately, researchers almost never consider the uncertainty associated with estimating such curves before declaring differences between performance of competing algorithms. Uncertainty is often large because the testing fractions of interest to researchers are small. Appropriate inference is complicated by two sources of correlation that are often overlooked: correlation across different testing fractions within a single algorithm, and correlation between competing algorithms. Additionally, researchers are often interested in making comparisons along the entire curve, not only at a few testing fractions. We develop inferential procedures to address both the needs of those interested in a few testing fractions, as well as those interested in the entire curve. For the former, four hypothesis testing and (pointwise) confidence intervals are investigated, and a newly developed EmProc approach is found to be most effective. For inference along entire curves, EmProc-based confidence bands are recommended for simultaneous coverage and minimal width. While we focus on the hit enrichment curve, this work is also appropriate for lift curves that are used throughout the machine learning community. Our inferential procedures trivially extend to enrichment factors, as well.
Collapse
Affiliation(s)
- Jeremy R Ash
- Department of Statistics, Bioinformatics Research Center, North Carolina State University, Raleigh, USA. .,JMP Division, SAS Institute, Cary, USA.
| | | |
Collapse
|
49
|
Yang C, Chen EA, Zhang Y. Protein-Ligand Docking in the Machine-Learning Era. Molecules 2022; 27:4568. [PMID: 35889440 PMCID: PMC9323102 DOI: 10.3390/molecules27144568] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Accepted: 07/14/2022] [Indexed: 11/16/2022] Open
Abstract
Molecular docking plays a significant role in early-stage drug discovery, from structure-based virtual screening (VS) to hit-to-lead optimization, and its capability and predictive power is critically dependent on the protein-ligand scoring function. In this review, we give a broad overview of recent scoring function development, as well as the docking-based applications in drug discovery. We outline the strategies and resources available for structure-based VS and discuss the assessment and development of classical and machine learning protein-ligand scoring functions. In particular, we highlight the recent progress of machine learning scoring function ranging from descriptor-based models to deep learning approaches. We also discuss the general workflow and docking protocols of structure-based VS, such as structure preparation, binding site detection, docking strategies, and post-docking filter/re-scoring, as well as a case study on the large-scale docking-based VS test on the LIT-PCBA data set.
Collapse
Affiliation(s)
- Chao Yang
- Department of Chemistry, New York University, New York, NY 10003, USA; (C.Y.); (E.A.C.)
| | - Eric Anthony Chen
- Department of Chemistry, New York University, New York, NY 10003, USA; (C.Y.); (E.A.C.)
| | - Yingkai Zhang
- Department of Chemistry, New York University, New York, NY 10003, USA; (C.Y.); (E.A.C.)
- NYU-ECNU Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
50
|
Ligand-Enhanced Negative Images Optimized for Docking Rescoring. Int J Mol Sci 2022; 23:ijms23147871. [PMID: 35887220 PMCID: PMC9323918 DOI: 10.3390/ijms23147871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2022] [Revised: 07/14/2022] [Accepted: 07/15/2022] [Indexed: 12/04/2022] Open
Abstract
Despite the pivotal role of molecular docking in modern drug discovery, the default docking scoring functions often fail to recognize active ligands in virtual screening campaigns. Negative image-based rescoring improves docking enrichment by comparing the shape/electrostatic potential (ESP) of the flexible docking poses against the target protein’s inverted cavity volume. By optimizing these negative image-based (NIB) models using a greedy search, the docking rescoring yield can be improved massively and consistently. Here, a fundamental modification is implemented to this shape-focused pharmacophore modelling approach—actual ligand 3D coordinates are incorporated into the NIB models for the optimization. This hybrid approach, labelled as ligand-enhanced brute-force negative image-based optimization (LBR-NiB), takes the best from both worlds, i.e., the all-roundedness of the NIB models and the difficult to emulate atomic arrangements of actual protein-bound small-molecule ligands. Thorough benchmarking, focused on proinflammatory targets, shows that the LBR-NiB routinely improves the docking enrichment over prior iterations of the R-NiB methodology. This boost can be massive, if the added ligand information provides truly essential binding information that was lacking or completely missing from the cavity-based NIB model. On a practical level, the results indicate that the LBR-NiB typically works well when the added ligand 3D data originates from a high-quality source, such as X-ray crystallography, and, yet, the NIB model compositions can also sometimes be improved by fusing into them, for example, with flexibly docked solvent molecules. In short, the study demonstrates that the protein-bound ligands can be used to improve the shape/ESP features of the negative images for effective docking rescoring use in virtual screening.
Collapse
|