1
|
An H, Liu X, Cai W, Shao X. Explainable Graph Neural Networks with Data Augmentation for Predicting p Ka of C-H Acids. J Chem Inf Model 2024; 64:2383-2392. [PMID: 37706462 DOI: 10.1021/acs.jcim.3c00958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2023]
Abstract
The pKa of C-H acids is an important parameter in the fields of organic synthesis, drug discovery, and materials science. However, the prediction of pKa is still a great challenge due to the limit of experimental data and the lack of chemical insight. Here, a new model for predicting the pKa values of C-H acids is proposed on the basis of graph neural networks (GNNs) and data augmentation. A message passing unit (MPU) was used to extract the topological and target-related information from the molecular graph data, and a readout layer was utilized to retrieve the information on the ionization site C atom. The retrieved information then was adopted to predict pKa by a fully connected network. Furthermore, to increase the diversity of the training data, a knowledge-infused data augmentation technique was established by replacing the H atoms in a molecule with substituents exhibiting different electronic effects. The MPU was pretrained with the augmented data. The efficacy of data augmentation was confirmed by visualizing the distribution of compounds with different substituents and by classifying compounds. The explainability of the model was studied by examining the change of pKa values when a specific atom was masked. This explainability was used to identify the key substituents for pKa. The model was evaluated on two data sets from the iBonD database. Dataset1 includes the experimental pKa values of C-H acids measured in DMSO, while dataset2 comprises the pKa values measured in water. The results show that the knowledge-infused data augmentation technique greatly improves the predictive accuracy of the model, especially when the number of samples is small.
Collapse
Affiliation(s)
- Hongle An
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Xuyang Liu
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Wensheng Cai
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Xueguang Shao
- Research Center for Analytical Sciences, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, College of Chemistry, Nankai University, Tianjin 300071, China
- Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| |
Collapse
|
2
|
Wang D, Li W, Dong X, Li H, Hu L. TFRegNCI: Interpretable Noncovalent Interaction Correction Multimodal Based on Transformer Encoder Fusion. J Chem Inf Model 2023; 63:782-793. [PMID: 36652718 DOI: 10.1021/acs.jcim.2c01283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
The interpretability is an important issue for end-to-end learning models. Motivated by computer vision algorithms, an interpretable noncovalent interaction (NCI) correction multimodal (TFRegNCI) is proposed for NCI prediction. TFRegNCI is based on RegNet feature extraction and a transformer encoder fusion strategy. RegNet is a network design paradigm that mainly focuses on local features. Meanwhile, the Vision Transformer is also leveraged for feature extraction, because it can capture global features better than RegNet while lowering the computational cost. Using a transformer encoder as the fusion strategy rather than multilayer perceptron can enhance model performance, due to its emphasis on important features with less parameters. Therefore, the proposed TFRegNCI achieved high accurate prediction (mean absolute error of ∼0.1 kcal/mol) comparing with the coupled cluster single double (triple) (CCSD(T)) benchmark. To further improve the model efficiency, TFRegNCI applies two-dimensional (2D) inputs transformed from three-dimensional (3D) electron density cubes, which saves time (30%), while the model accuracy remains. To improve model interpretability, a visualization module, Gradient-weighted Regression Activation Mapping (Grad-RAM) has been embedded. Grad-RAM is promoted from the classification algorithm, Gradient-weighted Class Activation Mapping, to perform feature visualization for the regression task. With Grad-RAM, the visual location map for features in deep learning models can be displayed. The feature map visualizations suggest that the 2D model has the similar performance as the 3D model, because of equally effective feature extractions from electron density. Moreover, the valid feature region on the location map by the 3D model is consistent with the NCIPLOT NCI isosurface. It is confirmed that the model does extract significant features related to the NCI interaction. The interpretable analyses are carried out through molecular orbital contribution on effective features. Thereby, the proposed model is likely to be a promising tool to reveal some essential information on NCIs, with regard to the level of electronic theory.
Collapse
Affiliation(s)
- Donghan Wang
- School of Information Science and Technology, Northeast Normal University, Changchun130117, China
| | - Wenze Li
- College of Computer and Information Engineering, Henan Normal University, Henan, Xinxiang453007, China
| | - Xu Dong
- School of Information Science and Technology, Northeast Normal University, Changchun130117, China
| | - Hongzhi Li
- School of Information Science and Technology, Northeast Normal University, Changchun130117, China
| | - LiHong Hu
- School of Information Science and Technology, Northeast Normal University, Changchun130117, China
| |
Collapse
|
3
|
Wei GW, Soares TA, Wahab H, Zhu F. Computational Chemistry in Asia. J Chem Inf Model 2022; 62:5035-5037. [DOI: 10.1021/acs.jcim.2c01050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
4
|
Duan C, Liu X, Cai W, Shao X. Spectral Encoder to Extract the Features of Near-Infrared Spectra for Multivariate Calibration. J Chem Inf Model 2022; 62:3695-3703. [PMID: 35916486 DOI: 10.1021/acs.jcim.2c00786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
An autoencoder architecture was adopted for near-infrared (NIR) spectral analysis by extracting the common features in the spectra. Three autoencoder-based networks with different purposes were constructed. First, a spectral encoder was established by training the network with a set of spectra as the input. The features of the spectra can be encoded by the nodes in the bottleneck layer, which in turn can be used to build a sparse and robust model. Second, taking the spectra of one instrument as the input and that of another instrument as the reference output, the common features in both spectra can be obtained in the bottleneck layer. Therefore, in the prediction step, the spectral features of the second can be predicted by taking the reverse of the decoder as the encoder. Furthermore, transfer learning was used to build the model for the spectra of more instruments by fine-tuning the trained network. NIR datasets of plant, wheat, and pharmaceutical tablets measured on multiple instruments were used to test the method. The multi-linear regression (MLR) model with the encoded features was found to have a similar or slightly better performance in prediction compared with the partial least-squares (PLS) model.
Collapse
Affiliation(s)
- Chaoshu Duan
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin 300071, China.,Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Xuyang Liu
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin 300071, China.,Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Wensheng Cai
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin 300071, China.,Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| | - Xueguang Shao
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Tianjin Key Laboratory of Biosensing and Molecular Recognition, State Key Laboratory of Medicinal Chemical Biology, Nankai University, Tianjin 300071, China.,Haihe Laboratory of Sustainable Chemical Transformations, Tianjin 300192, China
| |
Collapse
|
5
|
da Costa CHS, de Freitas CAB, Alves CN, Lameira J. Assessment of mutations on RBD in the Spike protein of SARS-CoV-2 Alpha, Delta and Omicron variants. Sci Rep 2022; 12:8540. [PMID: 35595778 PMCID: PMC9121086 DOI: 10.1038/s41598-022-12479-9] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2022] [Accepted: 05/03/2022] [Indexed: 12/15/2022] Open
Abstract
The severe acute respiratory syndrome (SARS) coronavirus 2 (CoV-2) variant Omicron spread more rapid than the other variants of SARS-CoV-2 virus. Mutations on the Spike (S) protein receptor-binding domain (RBD) are critical for the antibody resistance and infectivity of the SARS-CoV-2 variants. In this study, we have used accelerated molecular dynamics (aMD) simulations and free energy calculations to present a systematic analysis of the affinity and conformational dynamics along with the interactions that drive the binding between Spike protein RBD and human angiotensin-converting enzyme 2 (ACE2) receptor. We evaluate the impacts of the key mutation that occur in the RBDs Omicron and other variants in the binding with the human ACE2 receptor. The results show that S protein Omicron has stronger binding to the ACE2 than other variants. The evaluation of the decomposition energy per residue shows the mutations N440K, T478K, Q493R and Q498R observed in Spike protein of SARS-CoV-2 provided a stabilization effect for the interaction between the SARS-CoV-2 RBD and ACE2. Overall, the results demonstrate that faster spreading of SARS-CoV-2 Omicron may be correlated with binding affinity of S protein RBD to ACE2 and mutations of uncharged residues to positively charged residues such as Lys and Arg in key positions in the RBD.
Collapse
Affiliation(s)
- Clauber Henrique Souza da Costa
- Laboratório de Planejamento e Desenvolvimento de Fármacos, Universidade Federal do Pará, Rua Augusto Correa S/N, Belém, PA, Brazil
| | - Camila Auad Beltrão de Freitas
- Laboratório de Planejamento e Desenvolvimento de Fármacos, Universidade Federal do Pará, Rua Augusto Correa S/N, Belém, PA, Brazil
| | - Cláudio Nahum Alves
- Laboratório de Planejamento e Desenvolvimento de Fármacos, Universidade Federal do Pará, Rua Augusto Correa S/N, Belém, PA, Brazil
| | - Jerônimo Lameira
- Laboratório de Planejamento e Desenvolvimento de Fármacos, Universidade Federal do Pará, Rua Augusto Correa S/N, Belém, PA, Brazil.
| |
Collapse
|