1
|
Tian T, Li S, Zhang Z, Chen L, Zou Z, Zhao D, Zeng J. Benchmarking compound activity prediction for real-world drug discovery applications. Commun Chem 2024; 7:127. [PMID: 38834746 DOI: 10.1038/s42004-024-01204-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 05/16/2024] [Indexed: 06/06/2024] Open
Abstract
Identifying active compounds for target proteins is fundamental in early drug discovery. Recently, data-driven computational methods have demonstrated promising potential in predicting compound activities. However, there lacks a well-designed benchmark to comprehensively evaluate these methods from a practical perspective. To fill this gap, we propose a Compound Activity benchmark for Real-world Applications (CARA). Through carefully distinguishing assay types, designing train-test splitting schemes and selecting evaluation metrics, CARA can consider the biased distribution of current real-world compound activity data and avoid overestimation of model performances. We observed that although current models can make successful predictions for certain proportions of assays, their performances varied across different assays. In addition, evaluation of several few-shot training strategies demonstrated different performances related to task types. Overall, we provide a high-quality dataset for developing and evaluating compound activity prediction models, and the analyses in this work may inspire better applications of data-driven models in drug discovery.
Collapse
Affiliation(s)
- Tingzhong Tian
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Shuya Li
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China
| | - Ziting Zhang
- Department of Automation, Tsinghua University, Beijing, China
- MOE Key Laboratory of Bioinformatics, Tsinghua University, Beijing, China
| | - Lin Chen
- Silexon AI Technology Co., Ltd., Nanjing, Jiangsu Province, China
| | - Ziheng Zou
- Silexon AI Technology Co., Ltd., Nanjing, Jiangsu Province, China
| | - Dan Zhao
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China.
- School of Engineering, Westlake University, Hangzhou, Zhejiang Province, China.
| |
Collapse
|
2
|
Bajorath J. Chemical language models for molecular design. Mol Inform 2024; 43:e202300288. [PMID: 38010610 DOI: 10.1002/minf.202300288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/22/2023] [Accepted: 11/23/2023] [Indexed: 11/29/2023]
Abstract
In drug discovery, chemical language models (CLMs) originating from natural language processing offer new opportunities for molecular design. CLMs have been developed using recurrent neural network (RNN) or transformer architectures. For the predictive performance of RNN-based encoder-decoder frameworks and transformers, attention mechanisms play a central role. Among others, emerging application areas for CLMs include constrained generative modeling and the prediction of chemical reactions or drug-target interactions. Since CLMs are applicable to any compound or target data that can be presented in a sequential format and tokenized, mappings of different types of sequences can be learned. For example, active compounds can be predicted from protein sequence motifs. Novel off-the-beat-path applications can also be considered. For example, analogue series from medicinal chemistry can be perceived and represented as chemical sequences and extended with new compounds using CLMs. Herein, methodological features of CLMs and different applications are discussed.
Collapse
Affiliation(s)
- Jürgen Bajorath
- Department of Life Science Informatics, Bonn-Aachen International Center for Information Technology, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, D-53115, Bonn, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, D-53115, Bonn, Germany
| |
Collapse
|
3
|
Djoumbou-Feunang Y, Wilmot J, Kinney J, Chanda P, Yu P, Sader A, Sharifi M, Smith S, Ou J, Hu J, Shipp E, Tomandl D, Kumpatla SP. Cheminformatics and artificial intelligence for accelerating agrochemical discovery. Front Chem 2023; 11:1292027. [PMID: 38093816 PMCID: PMC10716421 DOI: 10.3389/fchem.2023.1292027] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Accepted: 11/09/2023] [Indexed: 10/17/2024] Open
Abstract
The global cost-benefit analysis of pesticide use during the last 30 years has been characterized by a significant increase during the period from 1990 to 2007 followed by a decline. This observation can be attributed to several factors including, but not limited to, pest resistance, lack of novelty with respect to modes of action or classes of chemistry, and regulatory action. Due to current and projected increases of the global population, it is evident that the demand for food, and consequently, the usage of pesticides to improve yields will increase. Addressing these challenges and needs while promoting new crop protection agents through an increasingly stringent regulatory landscape requires the development and integration of infrastructures for innovative, cost- and time-effective discovery and development of novel and sustainable molecules. Significant advances in artificial intelligence (AI) and cheminformatics over the last two decades have improved the decision-making power of research scientists in the discovery of bioactive molecules. AI- and cheminformatics-driven molecule discovery offers the opportunity of moving experiments from the greenhouse to a virtual environment where thousands to billions of molecules can be investigated at a rapid pace, providing unbiased hypothesis for lead generation, optimization, and effective suggestions for compound synthesis and testing. To date, this is illustrated to a far lesser extent in the publicly available agrochemical research literature compared to drug discovery. In this review, we provide an overview of the crop protection discovery pipeline and how traditional, cheminformatics, and AI technologies can help to address the needs and challenges of agrochemical discovery towards rapidly developing novel and more sustainable products.
Collapse
Affiliation(s)
| | - Jeremy Wilmot
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - John Kinney
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Pritam Chanda
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Pulan Yu
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - Avery Sader
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - Max Sharifi
- Corteva Agriscience, Regulatory and Stewardship, Indianapolis, IN, United States
| | - Scott Smith
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Junjun Ou
- Corteva Agriscience, Crop Protection Discovery and Development, Indianapolis, IN, United States
| | - Jie Hu
- Corteva Agriscience, Farming Solutions and Digital, Indianapolis, IN, United States
| | - Elizabeth Shipp
- Corteva Agriscience UK Limited, Regulation Innovation Center, Abingdon, United Kingdom
| | | | | |
Collapse
|
4
|
Niazi SK, Mariam Z. Recent Advances in Machine-Learning-Based Chemoinformatics: A Comprehensive Review. Int J Mol Sci 2023; 24:11488. [PMID: 37511247 PMCID: PMC10380192 DOI: 10.3390/ijms241411488] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 06/30/2023] [Accepted: 07/12/2023] [Indexed: 07/30/2023] Open
Abstract
In modern drug discovery, the combination of chemoinformatics and quantitative structure-activity relationship (QSAR) modeling has emerged as a formidable alliance, enabling researchers to harness the vast potential of machine learning (ML) techniques for predictive molecular design and analysis. This review delves into the fundamental aspects of chemoinformatics, elucidating the intricate nature of chemical data and the crucial role of molecular descriptors in unveiling the underlying molecular properties. Molecular descriptors, including 2D fingerprints and topological indices, in conjunction with the structure-activity relationships (SARs), are pivotal in unlocking the pathway to small-molecule drug discovery. Technical intricacies of developing robust ML-QSAR models, including feature selection, model validation, and performance evaluation, are discussed herewith. Various ML algorithms, such as regression analysis and support vector machines, are showcased in the text for their ability to predict and comprehend the relationships between molecular structures and biological activities. This review serves as a comprehensive guide for researchers, providing an understanding of the synergy between chemoinformatics, QSAR, and ML. Due to embracing these cutting-edge technologies, predictive molecular analysis holds promise for expediting the discovery of novel therapeutic agents in the pharmaceutical sciences.
Collapse
Affiliation(s)
- Sarfaraz K Niazi
- College of Pharmacy, University of Illinois, Chicago, IL 61820, USA
| | - Zamara Mariam
- Zamara Mariam, School of Interdisciplinary Engineering & Sciences (SINES), National University of Sciences & Technology (NUST), Islamabad 24090, Pakistan
| |
Collapse
|
5
|
|
6
|
Yoshimori A, Bajorath J. Computational analysis, alignment and extension of analogue series from medicinal chemistry. Future Sci OA 2022; 8:FSO804. [PMID: 36248066 PMCID: PMC9540237 DOI: 10.2144/fsoa-2022-0033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 06/10/2022] [Indexed: 11/23/2022] Open
Affiliation(s)
- Atsushi Yoshimori
- Department of Life Science Informatics & Data Science, B-IT, LIMES Program Unit Chemical Biology & Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn, D 53115, Germany
| | - Jürgen Bajorath
- Institute for Theoretical Medicine, Inc., 26-1 Muraoka-Higashi 2-chome, Fujisawa, Kanagawa, 2510012, Japan
| |
Collapse
|
7
|
Naveja JJ, Vogt M. Automatic Identification of Analogue Series from Large Compound Data Sets: Methods and Applications. Molecules 2021; 26:5291. [PMID: 34500724 PMCID: PMC8433811 DOI: 10.3390/molecules26175291] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Revised: 08/27/2021] [Accepted: 08/28/2021] [Indexed: 01/21/2023] Open
Abstract
Analogue series play a key role in drug discovery. They arise naturally in lead optimization efforts where analogues are explored based on one or a few core structures. However, it is much harder to accurately identify and extract pairs or series of analogue molecules in large compound databases with no predefined core structures. This methodological review outlines the most common and recent methodological developments to automatically identify analogue series in large libraries. Initial approaches focused on using predefined rules to extract scaffold structures, such as the popular Bemis-Murcko scaffold. Later on, the matched molecular pair concept led to efficient algorithms to identify similar compounds sharing a common core structure by exploring many putative scaffolds for each compound. Further developments of these ideas yielded, on the one hand, approaches for hierarchical scaffold decomposition and, on the other hand, algorithms for the extraction of analogue series based on single-site modifications (so-called matched molecular series) by exploring potential scaffold structures based on systematic molecule fragmentation. Eventually, further development of these approaches resulted in methods for extracting analogue series defined by a single core structure with several substitution sites that allow convenient representations, such as R-group tables. These methods enable the efficient analysis of large data sets with hundreds of thousands or even millions of compounds and have spawned many related methodological developments.
Collapse
Affiliation(s)
- José J. Naveja
- Instituto de Química, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico;
| | - Martin Vogt
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5-6, 53115 Bonn, Germany
| |
Collapse
|
8
|
Systematic assessment of structure-promiscuity relationships between different types of kinase inhibitors. Bioorg Med Chem 2021; 41:116226. [PMID: 34082305 DOI: 10.1016/j.bmc.2021.116226] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Revised: 05/17/2021] [Accepted: 05/18/2021] [Indexed: 12/29/2022]
Abstract
Given the increasing quest for selective kinase inhibitors, we have systematically investigated structural and structure-promiscuity relationships between promiscuous kinase inhibitors and other types with increasing potential for selective kinase inhibition. Therefore, inhibitors with different modes of action were extracted from X-ray structures of kinase complexes. For more than 18,000 promiscuous kinase inhibitors and 1253 type I1/2, II, and allosteric inhibitors with structurally confirmed mechanisms, analogue space was systematically charted. These inhibitors were active against a total of 426 human kinases. While nearly 80% of the promiscuous inhibitors formed related analogues series, only ~30% of other types of inhibitors were involved in such structural relationships and many of these inhibitors also had multi-kinase activity. Thus, most of the investigated type I1/2, II, and allosteric inhibitors with reported single-kinase activity were distinguished from promiscuous inhibitors, thus indicating potential for kinase selectivity. Structural relationships between promiscuous inhibitors and the subset of other inhibitors were organized in a matrix format including kinase activity profiles, revealing structure-promiscuity relationships for follow-up investigations.
Collapse
|
9
|
Yoshimori A, Hu H, Bajorath J. Adapting the DeepSARM approach for dual-target ligand design. J Comput Aided Mol Des 2021; 35:587-600. [PMID: 33712972 PMCID: PMC8131309 DOI: 10.1007/s10822-021-00379-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 02/24/2021] [Indexed: 11/29/2022]
Abstract
The structure–activity relationship (SAR) matrix (SARM) methodology and data structure was originally developed to extract structurally related compound series from data sets of any composition, organize these series in matrices reminiscent of R-group tables, and visualize SAR patterns. The SARM approach combines the identification of structural relationships between series of active compounds with analog design, which is facilitated by systematically exploring combinations of core structures and substituents that have not been synthesized. The SARM methodology was extended through the introduction of DeepSARM, which added deep learning and generative modeling to target-based analog design by taking compound information from related targets into account to further increase structural novelty. Herein, we present the foundations of the SARM methodology and discuss how DeepSARM modeling can be adapted for the design of compounds with dual-target activity. Generating dual-target compounds represents an equally attractive and challenging task for polypharmacology-oriented drug discovery. The DeepSARM-based approach is illustrated using a computational proof-of-concept application focusing on the design of candidate inhibitors for two prominent anti-cancer targets.
Collapse
Affiliation(s)
- Atsushi Yoshimori
- Institute for Theoretical Medicine, Inc., 26-1 Muraoka-Higashi 2-chome, Fujisawa, Kanagawa, 251-0012, Japan
| | - Huabin Hu
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany.
| |
Collapse
|
10
|
Yoshimori A, Bajorath J. The SAR Matrix Method and an Artificially Intelligent Variant for the Identification and Structural Organization of Analog Series, SAR Analysis, and Compound Design. Mol Inform 2020; 39:e2000045. [PMID: 32271994 PMCID: PMC7816269 DOI: 10.1002/minf.202000045] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Accepted: 04/09/2020] [Indexed: 11/26/2022]
Abstract
The SAR Matrix (SARM) approach was originally conceived for the systematic identification of analog series, their structural organization, and graphical structure-activity relationship (SAR) analysis. For structurally related series, SARMs also produce virtual candidate compounds. Hence, SARM represents a unique computational approach establishing a direct link between SAR visualization and compound design. The SARM data structure is reminiscent of R-group tables and hence easily accessible from a chemical perspective, although the underlying algorithmic basis is complex. The SARM concept has been extended in different ways to further increase its analytical and design capacity. While the efforts were largely driven from a research perspective, they have also increased the utility for practical applications. Among others, extensions include approaches for SARM-based compound activity prediction, the generation of a large SARM database for analog searching, and the design of a deep learning architecture for advanced analog design taking chemical space information for target families into account. Herein, the SARM approach and its extensions are discussed within their scientific context.
Collapse
Affiliation(s)
- Atsushi Yoshimori
- Institute for Theoretical Medicine, Inc.26-1 Muraoka-Higashi 2-chomeFujisawa, Kanagawa251-0012Japan
| | - Jürgen Bajorath
- Department of Life Science Informatics Bonn-Aachen International Center for Information TechnologyRheinische Friedrich-Wilhelms-Universität BonnEndenicher Allee 19cD-53115BonnGermany
| |
Collapse
|