1
|
Abarbanel OD, Hutchison GR. QupKake: Integrating Machine Learning and Quantum Chemistry for Micro-p Ka Predictions. J Chem Theory Comput 2024; 20:6946-6956. [PMID: 38832803 PMCID: PMC11325546 DOI: 10.1021/acs.jctc.4c00328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2024]
Abstract
Accurate prediction of micro-pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. This work introduces QupKake, a novel method that combines graph neural network models with semiempirical quantum mechanical (QM) features to achieve exceptional accuracy and generalization in micro-pKa prediction. QupKake outperforms state-of-the-art models on a variety of benchmark data sets, with root-mean-square errors between 0.5 and 0.8 pKa units on five external test sets. Feature importance analysis reveals the crucial role of QM features in both the reaction site enumeration and micro-pKa prediction models. QupKake represents a significant advancement in micro-pKa prediction, offering a powerful tool for various applications in chemistry and beyond.
Collapse
Affiliation(s)
- Omri D Abarbanel
- Department of Chemistry, University of Pittsburgh, 219 Parkman Avenue, Pittsburgh, Pennsylvania 15260, United States
| | - Geoffrey R Hutchison
- Department of Chemistry, University of Pittsburgh, 219 Parkman Avenue, Pittsburgh, Pennsylvania 15260, United States
- Department of Chemical and Petroleum Engineering, University of Pittsburgh, 3700 O'Hara Street, Pittsburgh, Pennsylvania 15261, United States
| |
Collapse
|
2
|
Mansouri K, Moreira-Filho JT, Lowe CN, Charest N, Martin T, Tkachenko V, Judson R, Conway M, Kleinstreuer NC, Williams AJ. Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling. J Cheminform 2024; 16:19. [PMID: 38378618 PMCID: PMC10880251 DOI: 10.1186/s13321-024-00814-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 02/10/2024] [Indexed: 02/22/2024] Open
Abstract
The rapid increase of publicly available chemical structures and associated experimental data presents a valuable opportunity to build robust QSAR models for applications in different fields. However, the common concern is the quality of both the chemical structure information and associated experimental data. This is especially true when those data are collected from multiple sources as chemical substance mappings can contain many duplicate structures and molecular inconsistencies. Such issues can impact the resulting molecular descriptors and their mappings to experimental data and, subsequently, the quality of the derived models in terms of accuracy, repeatability, and reliability. Herein we describe the development of an automated workflow to standardize chemical structures according to a set of standard rules and generate two and/or three-dimensional "QSAR-ready" forms prior to the calculation of molecular descriptors. The workflow was designed in the KNIME workflow environment and consists of three high-level steps. First, a structure encoding is read, and then the resulting in-memory representation is cross-referenced with any existing identifiers for consistency. Finally, the structure is standardized using a series of operations including desalting, stripping of stereochemistry (for two-dimensional structures), standardization of tautomers and nitro groups, valence correction, neutralization when possible, and then removal of duplicates. This workflow was initially developed to support collaborative modeling QSAR projects to ensure consistency of the results from the different participants. It was then updated and generalized for other modeling applications. This included modification of the "QSAR-ready" workflow to generate "MS-ready structures" to support the generation of substance mappings and searches for software applications related to non-targeted analysis mass spectrometry. Both QSAR and MS-ready workflows are freely available in KNIME, via standalone versions on GitHub, and as docker container resources for the scientific community. Scientific contribution: This work pioneers an automated workflow in KNIME, systematically standardizing chemical structures to ensure their readiness for QSAR modeling and broader scientific applications. By addressing data quality concerns through desalting, stereochemistry stripping, and normalization, it optimizes molecular descriptors' accuracy and reliability. The freely available resources in KNIME, GitHub, and docker containers democratize access, benefiting collaborative research and advancing diverse modeling endeavors in chemistry and mass spectrometry.
Collapse
Affiliation(s)
- Kamel Mansouri
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
| | - José T Moreira-Filho
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Charles N Lowe
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Nathaniel Charest
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Todd Martin
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | | | - Richard Judson
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Mike Conway
- National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Nicole C Kleinstreuer
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Antony J Williams
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| |
Collapse
|
3
|
Choung OH, Vianello R, Segler M, Stiefl N, Jiménez-Luna J. Extracting medicinal chemistry intuition via preference machine learning. Nat Commun 2023; 14:6651. [PMID: 37907461 PMCID: PMC10618272 DOI: 10.1038/s41467-023-42242-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 09/21/2023] [Indexed: 11/02/2023] Open
Abstract
The lead optimization process in drug discovery campaigns is an arduous endeavour where the input of many medicinal chemists is weighed in order to reach a desired molecular property profile. Building the expertise to successfully drive such projects collaboratively is a very time-consuming process that typically spans many years within a chemist's career. In this work we aim to replicate this process by applying artificial intelligence learning-to-rank techniques on feedback that was obtained from 35 chemists at Novartis over the course of several months. We exemplify the usefulness of the learned proxies in routine tasks such as compound prioritization, motif rationalization, and biased de novo drug design. Annotated response data is provided, and developed models and code made available through a permissive open-source license.
Collapse
Affiliation(s)
- Oh-Hyeon Choung
- Novartis Institutes for Biomedical Research, 4002, Basel, Switzerland
| | - Riccardo Vianello
- Novartis Institutes for Biomedical Research, 4002, Basel, Switzerland
| | - Marwin Segler
- Microsoft Research AI4Science, CB1 2FB, Cambridge, UK
| | - Nikolaus Stiefl
- Novartis Institutes for Biomedical Research, 4002, Basel, Switzerland.
| | | |
Collapse
|
4
|
Kappler MA, Lowden CT, Culberson J. BioChemUDM: a unified data model for compounds and assays. PURE APPL CHEM 2022. [DOI: 10.1515/pac-2021-1004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
We present a simple, biochemistry data model (BioChemUDM) to represent compounds and assays for the purpose of capturing, reporting, and sharing data, both biological and chemical. We describe an approach to register a compound based solely on a stereo-enhanced sketch, thereby replacing the need for additional user-specified “flags” at the time of compound registration. We describe a convention for string-based labels that enables inter-organizational compound and assay data sharing. By co-adopting the BioChemUDM, we have successfully enabled same-day exchange and utilization of chemical and biological information with various stakeholders.
Collapse
Affiliation(s)
- Michael A. Kappler
- IDEAYA Biosciences Inc , 7000 Shoreline Blvd Ste 350 , South San Francisco , CA 94080 , USA
| | | | - J. Chris Culberson
- Workflow Informatics Corp , 9316 Bramden Ct , Wake Forest , NC 27587 , USA
| |
Collapse
|
5
|
Ertl P, Gerebtzoff G, Lewis RA, Muenkler H, Schneider N, Sirockin F, Stiefl N, Tosco P. Chemical reactivity prediction: current methods and different application areas. Mol Inform 2021; 41:e2100277. [PMID: 34964302 DOI: 10.1002/minf.202100277] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 12/28/2021] [Indexed: 11/10/2022]
Abstract
The ability to predict chemical reactivity of a molecule is highly desirable in drug discovery, both ex vivo (synthetic route planning, formulation, stability) and in vivo: metabolic reactions determine pharmacodynamics, pharmacokinetics and potential toxic effects, and early assessment of liabilities is vital to reduce attrition rates in later stages of development. Quantum mechanics offer a precise description of the interactions between electrons and orbitals in the breaking and forming of new bonds. Modern algorithms and faster computers have allowed the study of more complex systems in a punctual and accurate fashion, and answers for chemical questions around stability and reactivity can now be provided. Through machine learning, predictive models can be built out of descriptors derived from quantum mechanics and cheminformatics, even in the absence of experimental data to train on. In this article, current progress on computational reactivity prediction is reviewed: applications to problems in drug design, such as modelling of metabolism and covalent inhibition, are highlighted and unmet challenges are posed.
Collapse
Affiliation(s)
| | | | - Richard A Lewis
- Computer-Aided Drug Design, Eli Lilly and Company Limited, Windlesham, SWITZERLAND
| | - Hagen Muenkler
- Novartis Institutes for BioMedical Research Inc, SWITZERLAND
| | | | | | | | - Paolo Tosco
- Novartis Institutes for BioMedical Research Inc, SWITZERLAND
| |
Collapse
|