1
|
Abarbanel OD, Hutchison GR. QupKake: Integrating Machine Learning and Quantum Chemistry for Micro-p Ka Predictions. J Chem Theory Comput 2024. [PMID: 38832803 DOI: 10.1021/acs.jctc.4c00328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2024]
Abstract
Accurate prediction of micro-pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. This work introduces QupKake, a novel method that combines graph neural network models with semiempirical quantum mechanical (QM) features to achieve exceptional accuracy and generalization in micro-pKa prediction. QupKake outperforms state-of-the-art models on a variety of benchmark data sets, with root-mean-square errors between 0.5 and 0.8 pKa units on five external test sets. Feature importance analysis reveals the crucial role of QM features in both the reaction site enumeration and micro-pKa prediction models. QupKake represents a significant advancement in micro-pKa prediction, offering a powerful tool for various applications in chemistry and beyond.
Collapse
Affiliation(s)
- Omri D Abarbanel
- Department of Chemistry, University of Pittsburgh, 219 Parkman Avenue, Pittsburgh, Pennsylvania 15260, United States
| | - Geoffrey R Hutchison
- Department of Chemistry, University of Pittsburgh, 219 Parkman Avenue, Pittsburgh, Pennsylvania 15260, United States
- Department of Chemical and Petroleum Engineering, University of Pittsburgh, 3700 O'Hara Street, Pittsburgh, Pennsylvania 15261, United States
| |
Collapse
|
2
|
An H, Liu X, Cai W, Shao X. Explainable Graph Neural Networks with Data Augmentation for Predicting p Ka of C-H Acids. J Chem Inf Model 2024; 64:2383-2392. [PMID: 37706462 DOI: 10.1021/acs.jcim.3c00958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]
Abstract
The pKa of C-H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pKa is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pKa values of C-H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict pKa by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of pKa values when a specific atom was masked. This explainability was used to identify the key substituents for pKa. The model was evaluated on two data sets from the iBonD database. Dataset1 includes the experimental pKa values of C-H acids measured in DMSO, while dataset2 comprises the pKa values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.
Collapse
Affiliation(s)
- Hongle An
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Xuyang Liu
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Wensheng Cai
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Xueguang Shao
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| |
Collapse
|
3
|
Kołodziejczyk A, Wróblewska A, Pietrzak M, Pyrcz P, Błaziak K, Szmigielski R. Dissociation constants of relevant secondary organic aerosol components in the atmosphere. CHEMOSPHERE 2024; 351:141166. [PMID: 38224752 DOI: 10.1016/j.chemosphere.2024.141166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 01/07/2024] [Accepted: 01/08/2024] [Indexed: 01/17/2024]
Abstract
The presented studies focus on measuring the determination of the acidity constant (pKa) of relevant secondary organic aerosol components. For our research, we selected important oxidation products (mainly carboxylic acids) of the most abundant terpene compounds, such as α-pinene, β-pinene, β-caryophyllene, and δ-3-carene. The research covered the synthesis and determination of the acidity constant of selected compounds. We used three methods to measure the acidity constant, i.e., 1H NMR titration, pH-metric titration, Bates-Schwarzenbach spectrophotometric method. Moreover, the pKa values were calculated with Marvin 21.17.0 software to compare the experimentally derived values with those calculated from the chemical structure. pKa values measured with 1H NMR titration ranged from 3.51 ± 0.01 for terebic acid to 5.18 ± 0.06 for β-norcaryophyllonic acid. Moreover, the data determined by the 1H NMR method revealed a good correlation with the data obtained with the commonly used potentiometric and UV-spectroscopic methods (R2 = 0.92). In contrast, the comparison with in silico results exhibits a relatively low correlation (R2Marvin = 0.66). We found that most of the values calculated with the Marvin Program are lower than experimental values obtained with pH-metric titration with an average difference of 0.44 pKa units. For di- and tricarboxylic acids, we obtained two and three pKa values, respectively. A good correlation with the literature values was observed, for example, Howell and Fisher (1958) used pH-metric titration and measured pKa1 and pKa2 to be 4.48 and 5.48, while our results are 4.24 ± 0.10 and 5.40 ± 0.02, respectively.
Collapse
Affiliation(s)
- Agata Kołodziejczyk
- Institute of Physical Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, 01-224, Warsaw, Poland.
| | - Aleksandra Wróblewska
- Institute of Physical Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, 01-224, Warsaw, Poland
| | - Mariusz Pietrzak
- Institute of Physical Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, 01-224, Warsaw, Poland
| | - Patryk Pyrcz
- Institute of Physical Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, 01-224, Warsaw, Poland
| | - Kacper Błaziak
- Faculty of Chemistry, University of Warsaw, ul. Pasteura 1, 01-224, Warsaw, Poland; Biological and Chemical Research Center, University of Warsaw, ul. Żwirki i Wigury 101, 01-224, Warsaw, Poland
| | - Rafał Szmigielski
- Institute of Physical Chemistry, Polish Academy of Sciences, ul. Kasprzaka 44/52, 01-224, Warsaw, Poland
| |
Collapse
|
4
|
Sanchez AJ, Maier S, Raghavachari K. Leveraging DFT and Molecular Fragmentation for Chemically Accurate p Ka Prediction Using Machine Learning. J Chem Inf Model 2024; 64:712-723. [PMID: 38301279 DOI: 10.1021/acs.jcim.3c01923] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
We present a quantum mechanical/machine learning (ML) framework based on random forest to accurately predict the pKas of complex organic molecules using inexpensive density functional theory (DFT) calculations. By including physics-based features from low-level DFT calculations and structural features from our connectivity-based hierarchy (CBH) fragmentation protocol, we can correct the systematic error associated with DFT. The generalizability and performance of our model are evaluated on two benchmark sets (SAMPL6 and Novartis). We believe the carefully curated input of physics-based features lessens the model's data dependence and need for complex deep learning architectures, without compromising the accuracy of the test sets. As a point of novelty, our work extends the applicability of CBH, employing it for the generation of viable molecular descriptors for ML.
Collapse
Affiliation(s)
- Alec J Sanchez
- Department of Chemistry, Indiana University?, Bloomington, Indiana 47405, United States
| | - Sarah Maier
- Department of Chemistry, Indiana University?, Bloomington, Indiana 47405, United States
| | - Krishnan Raghavachari
- Department of Chemistry, Indiana University?, Bloomington, Indiana 47405, United States
| |
Collapse
|
5
|
An Accurate Approach for Computational pKa Determination of Phenolic Compounds. MOLECULES (BASEL, SWITZERLAND) 2022; 27:molecules27238590. [PMID: 36500683 PMCID: PMC9736058 DOI: 10.3390/molecules27238590] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 11/30/2022] [Accepted: 12/01/2022] [Indexed: 12/12/2022]
Abstract
Computational chemistry is a valuable tool, as it allows for in silico prediction of key parameters of novel compounds, such as pKa. In the framework of computational pKa determination, the literature offers several approaches based on different level of theories, functionals and continuum solvation models. However, correction factors are often used to provide reliable models that adequately predict pKa. In this work, an accurate protocol based on a direct approach is proposed for computing phenols pKa. Importantly, this methodology does not require the use of correction factors or mathematical fitting, making it highly practical, easy to use and fast. Above all, DFT calculations performed in the presence two explicit water molecules using CAM-B3LYP functional with 6-311G+dp basis set and a solvation model based on density (SMD) led to accurate pKa values. In particular, calculations performed on a series of 13 differently substituted phenols provided reliable results, with a mean absolute error of 0.3. Furthermore, the model achieves accurate results with -CN and -NO2 substituents, which are usually excluded from computational pKa studies, enabling easy and reliable pKa determination in a wide range of phenols.
Collapse
|