1
|
Sushko I, Novotarskyi S, Körner R, Pandey AK, Rupp M, Teetz W, Brandmaier S, Abdelaziz A, Prokopenko VV, Tanchuk VY, Todeschini R, Varnek A, Marcou G, Ertl P, Potemkin V, Grishina M, Gasteiger J, Schwab C, Baskin II, Palyulin VA, Radchenko EV, Welsh WJ, Kholodovych V, Chekmarev D, Cherkasov A, Aires-de-Sousa J, Zhang QY, Bender A, Nigsch F, Patiny L, Williams A, Tkachenko V, Tetko IV. Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information. J Comput Aided Mol Des 2011; 25:533-54. [PMID: 21660515 PMCID: PMC3131510 DOI: 10.1007/s10822-011-9440-2] [Citation(s) in RCA: 389] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2011] [Accepted: 05/24/2011] [Indexed: 11/25/2022]
Abstract
The Online Chemical Modeling Environment is a web-based platform that aims to automate and simplify the typical steps required for QSAR modeling. The platform consists of two major subsystems: the database of experimental measurements and the modeling framework. A user-contributed database contains a set of tools for easy input, search and modification of thousands of records. The OCHEM database is based on the wiki principle and focuses primarily on the quality and verifiability of the data. The database is tightly integrated with the modeling framework, which supports all the steps required to create a predictive model: data search, calculation and selection of a vast variety of molecular descriptors, application of machine learning methods, validation, analysis of the model and assessment of the applicability domain. As compared to other similar systems, OCHEM is not intended to re-implement the existing tools or models but rather to invite the original authors to contribute their results, make them publicly available, share them with other users and to become members of the growing research community. Our intention is to make OCHEM a widely used platform to perform the QSPR/QSAR studies online and share it with other users on the Web. The ultimate goal of OCHEM is collecting all possible chemoinformatics tools within one simple, reliable and user-friendly resource. The OCHEM is free for web users and it is available online at http://www.ochem.eu.
Collapse
|
Research Support, Non-U.S. Gov't |
14 |
389 |
2
|
Pereira F, Xiao K, Latino DARS, Wu C, Zhang Q, Aires-de-Sousa J. Machine Learning Methods to Predict Density Functional Theory B3LYP Energies of HOMO and LUMO Orbitals. J Chem Inf Model 2016; 57:11-21. [PMID: 28033004 DOI: 10.1021/acs.jcim.6b00340] [Citation(s) in RCA: 87] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Machine learning algorithms were explored for the fast estimation of HOMO and LUMO orbital energies calculated by DFT B3LYP, on the basis of molecular descriptors exclusively based on connectivity. The whole project involved the retrieval and generation of molecular structures, quantum chemical calculations for a database with >111 000 structures, development of new molecular descriptors, and training/validation of machine learning models. Several machine learning algorithms were screened, and an applicability domain was defined based on Euclidean distances to the training set. Random forest models predicted an external test set of 9989 compounds achieving mean absolute error (MAE) up to 0.15 and 0.16 eV for the HOMO and LUMO orbitals, respectively. The impact of the quantum chemical calculation protocol was assessed with a subset of compounds. Inclusion of the orbital energy calculated by PM7 as an additional descriptor significantly improved the quality of estimations (reducing the MAE in >30%).
Collapse
|
Research Support, Non-U.S. Gov't |
9 |
87 |
3
|
Pereira F, Aires-de-Sousa J. Computational Methodologies in the Exploration of Marine Natural Product Leads. Mar Drugs 2018; 16:md16070236. [PMID: 30011882 PMCID: PMC6070892 DOI: 10.3390/md16070236] [Citation(s) in RCA: 57] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2018] [Revised: 07/02/2018] [Accepted: 07/06/2018] [Indexed: 12/18/2022] Open
Abstract
Computational methodologies are assisting the exploration of marine natural products (MNPs) to make the discovery of new leads more efficient, to repurpose known MNPs, to target new metabolites on the basis of genome analysis, to reveal mechanisms of action, and to optimize leads. In silico efforts in drug discovery of NPs have mainly focused on two tasks: dereplication and prediction of bioactivities. The exploration of new chemical spaces and the application of predicted spectral data must be included in new approaches to select species, extracts, and growth conditions with maximum probabilities of medicinal chemistry novelty. In this review, the most relevant current computational dereplication methodologies are highlighted. Structure-based (SB) and ligand-based (LB) chemoinformatics approaches have become essential tools for the virtual screening of NPs either in small datasets of isolated compounds or in large-scale databases. The most common LB techniques include Quantitative Structure–Activity Relationships (QSAR), estimation of drug likeness, prediction of adsorption, distribution, metabolism, excretion, and toxicity (ADMET) properties, similarity searching, and pharmacophore identification. Analogously, molecular dynamics, docking and binding cavity analysis have been used in SB approaches. Their significance and achievements are the main focus of this review.
Collapse
|
Review |
7 |
57 |
4
|
Qu X, Latino DA, Aires-de-Sousa J. A big data approach to the ultra-fast prediction of DFT-calculated bond energies. J Cheminform 2013; 5:34. [PMID: 23849655 PMCID: PMC3720218 DOI: 10.1186/1758-2946-5-34] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2013] [Accepted: 07/08/2013] [Indexed: 11/26/2022] Open
Abstract
Background The rapid access to intrinsic physicochemical properties of molecules is highly desired for large scale chemical data mining explorations such as mass spectrum prediction in metabolomics, toxicity risk assessment and drug discovery. Large volumes of data are being produced by quantum chemistry calculations, which provide increasing accurate estimations of several properties, e.g. by Density Functional Theory (DFT), but are still too computationally expensive for those large scale uses. This work explores the possibility of using large amounts of data generated by DFT methods for thousands of molecular structures, extracting relevant molecular properties and applying machine learning (ML) algorithms to learn from the data. Once trained, these ML models can be applied to new structures to produce ultra-fast predictions. An approach is presented for homolytic bond dissociation energy (BDE). Results Machine learning models were trained with a data set of >12,000 BDEs calculated by B3LYP/6-311++G(d,p)//DFTB. Descriptors were designed to encode atom types and connectivity in the 2D topological environment of the bonds. The best model, an Associative Neural Network (ASNN) based on 85 bond descriptors, was able to predict the BDE of 887 bonds in an independent test set (covering a range of 17.67–202.30 kcal/mol) with RMSD of 5.29 kcal/mol, mean absolute deviation of 3.35 kcal/mol, and R2 = 0.953. The predictions were compared with semi-empirical PM6 calculations, and were found to be superior for all types of bonds in the data set, except for O-H, N-H, and N-N bonds. The B3LYP/6-311++G(d,p)//DFTB calculations can approach the higher-level calculations B3LYP/6-311++G(3df,2p)//B3LYP/6-31G(d,p) with an RMSD of 3.04 kcal/mol, which is less than the RMSD of ASNN (against both DFT methods). An experimental web service for on-line prediction of BDEs is available at http://joao.airesdesousa.com/bde. Conclusion Knowledge could be automatically extracted by machine learning techniques from a data set of calculated BDEs, providing ultra-fast access to accurate estimations of DFT-calculated BDEs. This demonstrates how to extract value from large volumes of data currently being produced by quantum chemistry calculations at an increasing speed mostly without human intervention. In this way, high-level theoretical quantum calculations can be used in large-scale applications that otherwise would not afford the intrinsic computational cost.
Collapse
|
Journal Article |
12 |
34 |
5
|
Aires-de-Sousa J, Gasteiger J. New description of molecular chirality and its application to the prediction of the preferred enantiomer in stereoselective reactions. JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES 2001; 41:369-75. [PMID: 11277725 DOI: 10.1021/ci000125n] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
A new representation of molecular chirality as a fixed-length code is introduced. This code describes chiral carbon atoms using atomic properties and geometrical features independent of conformation and is able to distinguish between enantiomers. It was used as input to counterpropagation (CPG) neural networks in two different applications. In the case of a catalytic enantioselective reaction the CPG network established a correlation between the chirality codes of the catalysts and the major enantiomer obtained by the reaction. In the second application-enantioselective reduction of ketones by DIP-chloride-the series of major and minor enantiomers produced from different substrates were clustered by the CPG neural network into separate regions, one characteristic of the minor products and the other characteristic of the major products.
Collapse
|
|
24 |
32 |
6
|
Caetano S, Aires-de-Sousa J, Daszykowski M, Heyden YV. Prediction of enantioselectivity using chirality codes and Classification and Regression Trees. Anal Chim Acta 2005. [DOI: 10.1016/j.aca.2004.12.012] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
|
20 |
32 |
7
|
Carrera GV, Branco LC, Aires-de-Sousa J, Afonso CA. Exploration of quantitative structure–property relationships (QSPR) for the design of new guanidinium ionic liquids. Tetrahedron 2008. [DOI: 10.1016/j.tet.2007.12.021] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
|
17 |
25 |
8
|
Pereira F, Latino DARS, Aires-de-Sousa J. Estimation of Mayr electrophilicity with a quantitative structure-property relationship approach using empirical and DFT descriptors. J Org Chem 2011; 76:9312-9. [PMID: 21970444 DOI: 10.1021/jo201562f] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Quantitative structure-property relationships (QSPRs) were investigated for the estimation of the Mayr electrophilicity parameter using a data set of 64 compounds, all currently available uncharged electrophiles in Mayr's Database of Reactivity Parameters. Three collections of empirical descriptors were employed, from Dragon, Adriana.Code, and CDK. Models were built with multilinear regressions, k nearest neighbors, model trees, random forests, support vector machines (SVMs), associative neural networks, and counterpropagation neural networks. Quantum chemical descriptors were calculated with density functional theory (DFT) methods and incorporated in QSPR models. The best results were achieved with SVM using seven empirical and DFT descriptors; an R(2) of 0.92 was obtained for the test set (21 compounds). The final seven descriptors were the Parr electrophilicity index, ε(LUMO), hardness, and four CDK descriptors (FNSA-3, ATSc5, Kier2, and nAtomLAC). Screening of correlations between individual descriptors and Mayr electrophilicity revealed the highest absolute value of correlation for DFT ε(LUMO) (R = -0.82) and comparable correlations for some empirical descriptors, e.g., Dragon's folding degree index (R = -0.80), Kier flexibility index (R = -0.78), and Kier S2K index (R = -0.78). High correlations were observed in the training set between reactivity descriptors calculated by the PM6 semiempirical and DFT methods (R = 0.96 for ε(LUMO) and 0.94 for the electrophilicity index).
Collapse
|
Journal Article |
14 |
18 |
9
|
Fokoue HH, Marques JV, Correia MV, Yamaguchi LF, Qu X, Aires-de-Sousa J, Scotti MT, Lopes NP, Kato MJ. Fragmentation pattern of amides by EI and HRESI: study of protonation sites using DFT-3LYP data. RSC Adv 2018; 8:21407-21413. [PMID: 35539943 PMCID: PMC9080946 DOI: 10.1039/c7ra00408g] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 06/04/2018] [Indexed: 12/19/2022] Open
Abstract
Amides are important natural products which occur in a few plant families. Piplartine and piperine, major amides in Piper tuberculatum and P. nigrum, respectively, have shown a typical N–CO cleavage when analyzed by EI-MS or HRESI-MS. In this study several synthetic analogs of piplartine and piperine were subjected to both types of mass spectrometric analysis in order to identify structural features influencing fragmentation. Most of the amides showed an intense signal of the protonated molecule [M + H]+ when subjected to both HRESI-MS and EI-MS conditions, with a common outcome being the cleavage of the amide bond (N–CO). This results in the loss of the neutral amine or lactam and the formation of aryl acylium cations. The mechanism of N–CO bond cleavage persists in α,β-unsaturated amides because of the stability caused by extended conjugation. Computational methods determined that the protonation of the piperamides and their derivatives takes place preferentially at the amide nitrogen supporting the dominant the N–CO bond cleavage. The N–CO cleavage of α,β-unsaturated piperamides under EI and ESI is supported by computational studies.![]()
Collapse
|
|
7 |
9 |
10
|
Chen M, Wu T, Xiao K, Zhao T, Zhou Y, Zhang Q, Aires-de-Sousa J. Machine learning to predict the specific optical rotations of chiral fluorinated molecules. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2019; 223:117289. [PMID: 31255865 DOI: 10.1016/j.saa.2019.117289] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 05/31/2019] [Accepted: 06/17/2019] [Indexed: 06/09/2023]
Abstract
A chemoinformatics method was applied to the assignment of absolute configurations and to the quantitative prediction of specific optical rotations using a data set of 88 chiral fluorinated molecules (44 pairs of enantiomers). Counterpropagation neural networks were explored for the classification of enantiomers as dextrorotatory or levorotatory. Regression models were trained using multilayer perceptrons (MLP), random forests (RF) or multilinear regressions (MLR), on the basis of physicochemical atomic stereo (PAS) descriptors. New descriptors were also derived considering the common structural features of the data set (cPAS descriptors), which enabled RF models to predict the whole data set with R = 0.964, mean absolute error (MAE) of 9.8° and root mean square error (RMSE) of 12.5° in leave-one-pair-out cross-validation experiments. The predictions for the 30 compounds measured in chloroform were obtained with R = 0.971, MAE = 9.1° and RMSE = 12.5°, which compares favorably with quantum chemistry calculations reported in the literature.
Collapse
|
|
6 |
5 |
11
|
Mamede R, de-Almeida BS, Chen M, Zhang Q, Aires-de-Sousa J. Machine Learning Classification of One-Chiral-Center Organic Molecules According to Optical Rotation. J Chem Inf Model 2020; 61:67-75. [PMID: 33350814 DOI: 10.1021/acs.jcim.0c00876] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
In this study, machine learning algorithms were investigated for the classification of organic molecules with one carbon chiral center according to the sign of optical rotation. Diverse heterogeneous data sets comprising up to 13,080 compounds and their corresponding optical rotation were retrieved from Reaxys and processed independently for three solvents: dichloromethane, chloroform, and methanol. The molecular structures were represented by chiral descriptors based on the physicochemical and topological properties of ligands attached to the chiral center. The sign of optical rotation was predicted by random forests (RF) and artificial neural networks for independent test sets with an accuracy of up to 75% for dichloromethane, 82% for chloroform, and 82% for methanol. RF probabilities and the availability of structures in the training set with the same spheres of atom types around the chiral center defined applicability domains in which the accuracy is higher.
Collapse
|
Research Support, Non-U.S. Gov't |
5 |
3 |
12
|
Soares A, Estevao MS, Marques MMB, Kovalishyn V, Latino DARS, Aires-de-Sousa J, Ramos J, Viveiros M, Martins F. Synthesis and Biological Evaluation of Hybrid 1,5- and 2,5-Disubstituted Indoles as Potentially New Antitubercular Agents. Med Chem 2017; 13:439-447. [PMID: 28185538 DOI: 10.2174/1573406413666170209144003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Revised: 02/03/2017] [Accepted: 02/03/2017] [Indexed: 11/22/2022]
Abstract
BACKGROUND Tuberculosis (TB) is the second leading cause of mortality worldwide being a highly contagious and insidious illness caused by Mycobacterium tuberculosis, Mtb. Additionally, the emergence of multidrug-resistant and extensively drug-resistant strains of Mtb, together with significant levels of co-infection with HIV and TB (HIV/TB) make the search for new antitubercular drugs urgent and challenging. METHODS This work was based on the hypothesis that an active compound could be obtained if substituents present in some other active compounds were attached on a core of an important structure, in this case the indole scaffold, thus generating a hybrid compound. A QSAR-oriented design based on classification and regression models along with the estimation of physicochemical and biological properties have also been used to assist in the selection of compounds. Chosen compounds were synthesized using various synthetic procedures and evaluated against M. tuberculosis H37Rv strain. RESULTS Selected compounds possess substituents at positions C5, C2 and N1 of the indole ring. The substituents involve p-halophenyl, pyridyl, benzyloxy and benzylamine groups. Four compounds were synthesised using suitable synthetic procedures to attain the desired substitution at the indole core. From these, three compounds are new and have been fully characterized, and tested in vitro against the H37Rv ATCC27294T Mtb strain, using isoniazid as a control. One of them, compound 2, with the pyridyl group at N1, has an experimental log (1/MIC) very close to 5 and can be considered as being (weakly) active. In fact, it is more active than 64% of all indole molecules in our data sets of experimental results from literature. The most active indole in this data sets has log (1/MIC)=5.93 with only 6 compounds with log (1/MIC) above 5.5. CONCLUSION Despite the lower activity found for the tested compounds, when compared to other reported indole-derivatives, these structures, which rely on a hybrid design concept, may constitute interesting scaffolds to prepare a new family of TB inhibitors with improved activity.
Collapse
|
|
8 |
2 |
13
|
Gómez-Carracedo M, Ballabio D, Andrade J, Aires-de-Sousa J, Consonni V. Comparing roadsoils pollution patterns extracted by MOLMAP and classical three-way decomposition methods. Anal Chim Acta 2010; 677:64-71. [DOI: 10.1016/j.aca.2010.07.044] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2010] [Revised: 07/20/2010] [Accepted: 07/28/2010] [Indexed: 10/19/2022]
|
|
15 |
1 |
14
|
Gao X, Baimacheva N, Aires-de-Sousa J. Exploring Molecular Heteroencoders with Latent Space Arithmetic: Atomic Descriptors and Molecular Operators. Molecules 2024; 29:3969. [PMID: 39203047 PMCID: PMC11357237 DOI: 10.3390/molecules29163969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 08/04/2024] [Accepted: 08/06/2024] [Indexed: 09/03/2024] Open
Abstract
A variational heteroencoder based on recurrent neural networks, trained with SMILES linear notations of molecular structures, was used to derive the following atomic descriptors: delta latent space vectors (DLSVs) obtained from the original SMILES of the whole molecule and the SMILES of the same molecule with the target atom replaced. Different replacements were explored, namely, changing the atomic element, replacement with a character of the model vocabulary not used in the training set, or the removal of the target atom from the SMILES. Unsupervised mapping of the DLSV descriptors with t-distributed stochastic neighbor embedding (t-SNE) revealed a remarkable clustering according to the atomic element, hybridization, atomic type, and aromaticity. Atomic DLSV descriptors were used to train machine learning (ML) models to predict 19F NMR chemical shifts. An R2 of up to 0.89 and mean absolute errors of up to 5.5 ppm were obtained for an independent test set of 1046 molecules with random forests or a gradient-boosting regressor. Intermediate representations from a Transformer model yielded comparable results. Furthermore, DLSVs were applied as molecular operators in the latent space: the DLSV of a halogenation (H→F substitution) was summed to the LSVs of 4135 new molecules with no fluorine atom and decoded into SMILES, yielding 99% of valid SMILES, with 75% of the SMILES incorporating fluorine and 56% of the structures incorporating fluorine with no other structural change.
Collapse
|
research-article |
1 |
|
15
|
Aires-de-Sousa J. GUIDEMOL: A Python graphical user interface for molecular descriptors based on RDKit. Mol Inform 2024; 43:e202300190. [PMID: 37885368 DOI: 10.1002/minf.202300190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 10/24/2023] [Accepted: 10/26/2023] [Indexed: 10/28/2023]
Abstract
GUIDEMOL is a Python computer program based on the RDKit software to process molecular structures and calculate molecular descriptors with a graphical user interface using the tkinter package. It can calculate descriptors already implemented in RDKit as well as grid representations of 3D molecular structures using the electrostatic potential or voxels. The GUIDEMOL app provides easy access to RDKit tools for chemoinformatics users with no programming skills and can be adapted to calculate other descriptors or to trigger other procedures. A command line interface (CLI) is also provided for the calculation of grid representations. The source code is available at https://github.com/jairesdesousa/guidemol.
Collapse
|
|
1 |
|