1
|
Banerjee S, Dumawat S, Jha T, Lanka G, Adhikari N, Ghosh B. Fragment-based structural exploration and chemico-biological interaction study of HDAC3 inhibitors through non-linear pattern recognition, chemical space, and binding mode of interaction analysis. J Biomol Struct Dyn 2024; 42:8831-8853. [PMID: 37608752 DOI: 10.1080/07391102.2023.2248509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 08/10/2023] [Indexed: 08/24/2023]
Abstract
HDAC3 is an emerging target for the identification and discovery of novel drug candidates against several disease conditions including cancer. Here, a fragment-based non-linear machine learning (ML) method along with chemical space exploration followed by a structure-based binding mode of interaction analysis study was carried out on some HDAC3 inhibitors to obtain the key structural features modulating HDAC3 inhibition. Both the ML and chemical space analysis identified several physicochemical and structural properties namely lipophilicity, polar and relative polar surface area, arylcarboxamide moiety, bulky fused aromatic group, n-alkyl, and cinnamoyl moieties, the higher number of oxygen atoms, π-electrons for the substituted tetrahydrofuronaphthodioxolone moiety favorable for higher HDAC3 inhibition. Moreover, hydrogen bond forming capabilities, the length and substitution position of the linker moiety, the importance of phenyl ring in the linker motif, the contribution of heterocyclic cap moieties for effective inhibitor binding at the HDAC3 catalytic site that correspondingly affects the HDAC3 inhibitory potency. Again, macrocyclic ring structure and cyclohexyl cap moiety are responsible for lower HDAC3 inhibition. The MD simulation study of selected compounds explained strong binding patterns at the HDAC3 active site as evidenced by the lower RMSD and RMSF values. Nevertheless, it also explained the importance of the crucial structural fragments derived from the fragment-based analysis during ligand-enzyme interactions. Therefore, the outcomes of this current structural analysis will be a useful tool for fragment-based drug discovery of effective HDAC3 inhibitors for clinical therapeutics in the future.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Suvankar Banerjee
- Natural Science Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| | - Shraddha Dumawat
- Epigenetic Research Laboratory, Department of Pharmacy, Birla Institute of Technology and Science-Pilani, Hyderabad Campus, Shamirpet, Hyderabad, India
| | - Tarun Jha
- Natural Science Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| | - Goverdhan Lanka
- Epigenetic Research Laboratory, Department of Pharmacy, Birla Institute of Technology and Science-Pilani, Hyderabad Campus, Shamirpet, Hyderabad, India
| | - Nilanjan Adhikari
- Natural Science Laboratory, Division of Medicinal and Pharmaceutical Chemistry, Department of Pharmaceutical Technology, Jadavpur University, Kolkata, India
| | - Balaram Ghosh
- Epigenetic Research Laboratory, Department of Pharmacy, Birla Institute of Technology and Science-Pilani, Hyderabad Campus, Shamirpet, Hyderabad, India
| |
Collapse
|
2
|
Bonilla-Caraballo G, Rodriguez-Martinez M. Deep Learning Methods to Help Predict Properties of Molecules from SMILES. PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON INTELLIGENT COMPUTING AND NETWORKING 2024 : (ISICN 2024). INTERNATIONAL SYMPOSIUM ON INTELLIGENT COMPUTING AND NETWORKING (1ST : 2024 : SAN JUAN, P.R.) 2024; 1094:119-138. [PMID: 39493535 PMCID: PMC11529754 DOI: 10.1007/978-3-031-67447-1_9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2024]
Abstract
Machine learning methods have been proposed in lieu of simulations to predict chemical properties of molecules. The trade-off here is paying for the training time once, in exchange for instant predictions on the input data. However, many of these methods rely heavily on feature engineering to prepare the data for these models. Moreover, the use of molecular structural information has been limited, despite having such information encoded in the Simplified Molecular Input Line Entry System (SMILES) format. In this paper we present a framework that relies on SMILES data to predict molecular properties. Our methods are based on 1-D Convolutional Networks and do not require complex feature engineering. Our methods can be applied to learn molecular properties from base data, thus making them accessible to a wider audience. Our experiments show that this method can predict the molecular weight and XLogP properties without any encoding of complex chemical rules.
Collapse
|
3
|
Matsumoto Y, Gotoh H. Compound Classification and Consideration of Correlation with Chemical Descriptors from Articles on Antioxidant Capacity Using Natural Language Processing. J Chem Inf Model 2024; 64:119-127. [PMID: 38118462 DOI: 10.1021/acs.jcim.3c01826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2023]
Abstract
In recent times, there has been a substantial increase in the number of articles focusing on antioxidants. However, the development of a comprehensive estimator for antioxidant capacity remains elusive due to the challenge of integrating information from these articles. Furthermore, the complexity of the antioxidant mechanism, which involves a multitude of factors, makes it difficult to establish a simple equation or correlation. Hence, there is a pressing need for a model that can effectively interpret the collective knowledge from these articles, especially from a chemistry perspective. In this research, we employed natural language processing techniques, specifically Word2Vec, to analyze articles related to antioxidant capacity. We extracted representation vectors of compound names from these documents and organized them into 10 distinct clusters. In our investigation of two of these clusters, we unveiled that the majority of the compounds in question were flavonoids and flavonoid glycosides. To establish a link between the descriptors and clusters, we utilized kernel density estimation and generated scatter plots to visualize their similarity. These visualizations clearly indicated a strong relationship between the descriptors and clusters, affirming that a tangible connection exists between word vectors and compound descriptors through a document analysis conducted with natural language processing techniques. This study represents a pioneering approach that utilizes document analysis to shed light on the field of antioxidant capacity research, marking a significant advancement in this domain.
Collapse
Affiliation(s)
- Yuto Matsumoto
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| | - Hiroaki Gotoh
- Department of Chemistry and Life Science, Yokohama National University, 79-5 Tokiwadai, Hodogaya-ku, Yokohama 240-8501, Japan
| |
Collapse
|
4
|
Eswaran SCD, Subramaniam S, Sanyal U, Rallo R, Zhang X. Molecular structural dataset of lignin macromolecule elucidating experimental structural compositions. Sci Data 2022; 9:647. [PMID: 36273011 PMCID: PMC9588021 DOI: 10.1038/s41597-022-01709-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 09/20/2022] [Indexed: 11/23/2022] Open
Abstract
Lignin is one of the most abundant biopolymers in nature and has great potential to be transformed into high-value chemicals. However, the limited availability of molecular structure data hinders its potential industrial applications. Herein, we present the Lignin Structural (LGS) Dataset that includes the molecular structure of milled wood lignin focusing on two major monomeric units (coniferyl and syringyl), and the six most common interunit linkages (phenylpropane β-aryl ether, resinol, phenylcoumaran, biphenyl, dibenzodioxocin, and diaryl ether). The dataset constitutes a unique resource that covers a part of lignin’s chemical space characterized by polymer chains with lengths in the range of 3 to 25 monomer units. Structural data were generated using a sequence-controlled polymer generation approach that was calibrated to match experimental lignin properties. The LGS dataset includes 60 K newly generated lignin structures that match with high accuracy (~90%) the experimentally determined structural compositions available in the literature. The LGS dataset is a valuable resource to advance lignin chemistry research, including computational simulation approaches and predictive modelling. Measurement(s) | molecular structure | Technology Type(s) | Computer Modeling | Factor Type(s) | monomer ratio • bond frequency • degree of polymerization | Sample Characteristic - Organism | coniferous (softwood) • deciduous (hardwood) |
Collapse
Affiliation(s)
- Sudha Cheranma Devi Eswaran
- Bioproducts Sciences and Engineering Laboratory, Washington State University, 2710 Crimson Way, Richland, WA, 99354, USA.,Voiland School of Chemical Engineering and Bioengineering, Washington State University, Richland, WA, 99354, USA
| | - Senthil Subramaniam
- Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA, 99354, USA
| | - Udishnu Sanyal
- Bioproducts Sciences and Engineering Laboratory, Washington State University, 2710 Crimson Way, Richland, WA, 99354, USA.,Voiland School of Chemical Engineering and Bioengineering, Washington State University, Richland, WA, 99354, USA
| | - Robert Rallo
- Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA, 99354, USA.
| | - Xiao Zhang
- Bioproducts Sciences and Engineering Laboratory, Washington State University, 2710 Crimson Way, Richland, WA, 99354, USA. .,Voiland School of Chemical Engineering and Bioengineering, Washington State University, Richland, WA, 99354, USA. .,Pacific Northwest National Laboratory, 902 Battelle Blvd, Richland, WA, 99354, USA.
| |
Collapse
|
5
|
Ryu JY, Lee JH, Lee BH, Song JS, Ahn S, Oh KS. PredMS: a random forest model for predicting metabolic stability of drug candidates in human liver microsomes. Bioinformatics 2022; 38:364-368. [PMID: 34515778 DOI: 10.1093/bioinformatics/btab547] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 07/22/2021] [Accepted: 09/08/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Poor metabolic stability leads to drug development failure. Therefore, it is essential to evaluate the metabolic stability of small compounds for successful drug discovery and development. However, evaluating metabolic stability in vitro and in vivo is expensive, time-consuming and laborious. In addition, only a few free software programs are available for metabolic stability data and prediction. Therefore, in this study, we aimed to develop a prediction model that predicts the metabolic stability of small compounds. RESULTS We developed a computational model, PredMS, which predicts the metabolic stability of small compounds as stable or unstable in human liver microsomes. PredMS is based on a random forest model using an in-house database of metabolic stability data of 1917 compounds. To validate the prediction performance of PredMS, we generated external test data of 61 compounds. PredMS achieved an accuracy of 0.74, Matthew's correlation coefficient of 0.48, sensitivity of 0.70, specificity of 0.86, positive predictive value of 0.94 and negative predictive value of 0.46 on the external test dataset. PredMS will be a useful tool to predict the metabolic stability of small compounds in the early stages of drug discovery and development. AVAILABILITY AND IMPLEMENTATION The source code for PredMS is available at https://bitbucket.org/krictai/predms, and the PredMS web server is available at https://predms.netlify.app. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jae Yong Ryu
- Department of Biotechnology, Duksung Women's University, Seoul 01369, Republic of Korea
| | - Jeong Hyun Lee
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, 34114 Daejeon, Republic of Korea
| | - Byung Ho Lee
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, 34114 Daejeon, Republic of Korea
| | - Jin Sook Song
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, 34114 Daejeon, Republic of Korea
| | - Sunjoo Ahn
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, 34114 Daejeon, Republic of Korea.,Department of Medicinal and Pharmaceutical Chemistry, University of Science and Technology, Daejeon 34129, Republic of Korea
| | - Kwang-Seok Oh
- Data Convergence Drug Research Center, Korea Research Institute of Chemical Technology, 34114 Daejeon, Republic of Korea.,Department of Medicinal and Pharmaceutical Chemistry, University of Science and Technology, Daejeon 34129, Republic of Korea
| |
Collapse
|
6
|
Bhosale H, Ramakrishnan V, Jayaraman VK. Support vector machine-based prediction of pore-forming toxins (PFT) using distributed representation of reduced alphabets. J Bioinform Comput Biol 2021; 19:2150028. [PMID: 34693886 DOI: 10.1142/s0219720021500281] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Bacterial virulence can be attributed to a wide variety of factors including toxins that harm the host. Pore-forming toxins are one class of toxins that confer virulence to the bacteria and are one of the promising targets for therapeutic intervention. In this work, we develop a sequence-based machine learning framework for the prediction of pore-forming toxins. For this, we have used distributed representation of the protein sequence encoded by reduced alphabet schemes based on conformational similarity and hydropathy index as input features to Support Vector Machines (SVMs). The choice of conformational similarity and hydropathy indices is based on the functional mechanism of pore-forming toxins. Our methodology achieves about 81% accuracy indicating that conformational similarity, an indicator of the flexibility of amino acids, along with hydrophobic index can capture the intrinsic features of pore-forming toxins that distinguish it from other types of transporter proteins. Increased understanding of the mechanisms of pore-forming toxins can further contribute to the use of such "mechanism-informed" features that may increase the prediction accuracy further.
Collapse
Affiliation(s)
- Hrushikesh Bhosale
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| | - Vigneshwar Ramakrishnan
- School of Chemical & Biotechnology, SASTRA Deemed-to-be University, Thanjavur, Tamilnadu, India
| | - Valadi K Jayaraman
- Department of Computer Science, FLAME University, Pune, Maharashtra, India
| |
Collapse
|
7
|
Santiago Á, Guzmán-Ocampo DC, Aguayo-Ortiz R, Dominguez L. Characterizing the Chemical Space of γ-Secretase Inhibitors and Modulators. ACS Chem Neurosci 2021; 12:2765-2775. [PMID: 34291906 DOI: 10.1021/acschemneuro.1c00313] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
γ-Secretase (GS) is one of the most attractive molecular targets for the treatment of Alzheimer's disease (AD). Its key role in the final step of amyloid-β peptides generation and its relationship in the cascade of events for disease development have caught the attention of many pharmaceutical groups. Over the past years, different inhibitors and modulators have been evaluated as promising therapeutics against AD. However, despite the great chemical diversity of the reported compounds, a global classification and visual representation of the chemical space for GS inhibitors and modulators remain unavailable. In the present work, we carried out a two-dimensional (2D) chemical space analysis from different classes and subclasses of GS inhibitors and modulators based on their structural similarity. Along with the novel structural information available for GS complexes, our analysis opens the possibility to identify compounds with high molecular similarity, critical to finding new chemical structures through the optimization of existing compounds and relating them with a potential binding site.
Collapse
Affiliation(s)
- Ángel Santiago
- Departamento de Fisicoquímica, Facultad de Química, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| | - Dulce C. Guzmán-Ocampo
- Departamento de Fisicoquímica, Facultad de Química, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| | - Rodrigo Aguayo-Ortiz
- Departamento de Farmacia, Facultad de Química, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| | - Laura Dominguez
- Departamento de Fisicoquímica, Facultad de Química, Universidad Nacional Autónoma de México, Mexico City 04510, Mexico
| |
Collapse
|
8
|
Vaškevičius M, Kapočiūtė-Dzikienė J, Šlepikas L. Prediction of Chromatography Conditions for Purification in Organic Synthesis Using Deep Learning. Molecules 2021; 26:2474. [PMID: 33922736 PMCID: PMC8123027 DOI: 10.3390/molecules26092474] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 04/15/2021] [Accepted: 04/22/2021] [Indexed: 01/27/2023] Open
Abstract
In this research, a process for developing normal-phase liquid chromatography solvent systems has been proposed. In contrast to the development of conditions via thin-layer chromatography (TLC), this process is based on the architecture of two hierarchically connected neural network-based components. Using a large database of reaction procedures allows those two components to perform an essential role in the machine-learning-based prediction of chromatographic purification conditions, i.e., solvents and the ratio between solvents. In our paper, we build two datasets and test various molecular vectorization approaches, such as extended-connectivity fingerprints, learned embedding, and auto-encoders along with different types of deep neural networks to demonstrate a novel method for modeling chromatographic solvent systems employing two neural networks in sequence. Afterward, we present our findings and provide insights on the most effective methods for solving prediction tasks. Our approach results in a system of two neural networks with long short-term memory (LSTM)-based auto-encoders, where the first predicts solvent labels (by reaching the classification accuracy of 0.950 ± 0.001) and in the case of two solvents, the second one predicts the ratio between two solvents (R2 metric equal to 0.982 ± 0.001). Our approach can be used as a guidance instrument in laboratories to accelerate scouting for suitable chromatography conditions.
Collapse
Affiliation(s)
- Mantas Vaškevičius
- Department of Applied Informatics, Vytautas Magnus University, LT-44404 Kaunas, Lithuania;
- JSC Synhet, Biržų Str. 6, LT-44139 Kaunas, Lithuania;
| | | | | |
Collapse
|
9
|
Shibayama S, Funatsu K. Industrial Case Study: Identification of Important Substructures and Exploration of Monomers for the Rapid Design of Novel Network Polymers with Distributed Representation. BULLETIN OF THE CHEMICAL SOCIETY OF JAPAN 2021. [DOI: 10.1246/bcsj.20200220] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Shojiro Shibayama
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| | - Kimito Funatsu
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| |
Collapse
|
10
|
Prihoda D, Maritz JM, Klempir O, Dzamba D, Woelk CH, Hazuda DJ, Bitton DA, Hannigan GD. The application potential of machine learning and genomics for understanding natural product diversity, chemistry, and therapeutic translatability. Nat Prod Rep 2021; 38:1100-1108. [PMID: 33245088 DOI: 10.1039/d0np00055h] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Covering: up to the end of 2020. The machine learning field can be defined as the study and application of algorithms that perform classification and prediction tasks through pattern recognition instead of explicitly defined rules. Among other areas, machine learning has excelled in natural language processing. As such methods have excelled at understanding written languages (e.g. English), they are also being applied to biological problems to better understand the "genomic language". In this review we focus on recent advances in applying machine learning to natural products and genomics, and how those advances are improving our understanding of natural product biology, chemistry, and drug discovery. We discuss machine learning applications in genome mining (identifying biosynthetic signatures in genomic data), predictions of what structures will be created from those genomic signatures, and the types of activity we might expect from those molecules. We further explore the application of these approaches to data derived from complex microbiomes, with a focus on the human microbiome. We also review challenges in leveraging machine learning approaches in the field, and how the availability of other "omics" data layers provides value. Finally, we provide insights into the challenges associated with interpreting machine learning models and the underlying biology and promises of applying machine learning to natural product drug discovery. We believe that the application of machine learning methods to natural product research is poised to accelerate the identification of new molecular entities that may be used to treat a variety of disease indications.
Collapse
Affiliation(s)
- David Prihoda
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic and Department of Informatics and Chemistry, Faculty of Chemical Technology, University of Chemistry and Technology, Prague, Czech Republic
| | - Julia M Maritz
- Exploratory Science Center, Merck & Co., Inc., Cambridge, MA, USA.
| | - Ondrej Klempir
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | - David Dzamba
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | | | - Daria J Hazuda
- Exploratory Science Center, Merck & Co., Inc., Cambridge, MA, USA.
| | - Danny A Bitton
- R&D Informatics Solutions, MSD Czech Republic s.r.o., Prague, Czech Republic
| | | |
Collapse
|
11
|
Chakravarti SK. Reason Vectors: Abstract Representation of Chemistry–Biology Interaction Outcomes, for Reasoning and Prediction. J Chem Inf Model 2020; 60:4614-4628. [DOI: 10.1021/acs.jcim.0c00601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Suman K. Chakravarti
- MultiCASE Inc., 23811 Chagrin Blvd., Suite 305, Beachwood, Ohio 44122, United States
| |
Collapse
|
12
|
Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25:689-705. [DOI: 10.1016/j.drudis.2020.01.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 12/20/2019] [Accepted: 01/28/2020] [Indexed: 01/06/2023]
|
13
|
Shibayama S, Marcou G, Horvath D, Baskin II, Funatsu K, Varnek A. Application of the mol2vec Technology to Large-size Data Visualization and Analysis. Mol Inform 2020; 39:e1900170. [PMID: 32090493 DOI: 10.1002/minf.201900170] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Accepted: 02/11/2020] [Indexed: 11/09/2022]
Abstract
Generative Topographic Mapping (GTM) is a dimensionality reduction method, which is widely used for both data visualization and structure-activity modeling. Large dimensionality of the initial data space may require significant computational resources and slow down the GTM construction. Therefore, it may be meaningful to reduce the number of descriptors used for encoding molecular structures. The Principal Component Analysis (PCA), a standard preprocessing tool, suffers from the information loss upon the dimensionality reduction. As an alternative, we propose to use substructure vector embedding provided by the mol2vec technique. In addition to the data dimensionality reduction, this technology also accounts for proximity of substructures in molecular graphs. In this study, dimensionality of large descriptor spaces of ISIDA fragment descriptors or Morgan fingerprints were reduced using either the PCA or the mol2vec method. The latter significantly speeds up GTM training without compromising its predictive power in bioactivity classification tasks.
Collapse
Affiliation(s)
- Shojiro Shibayama
- Department of Chemical System Engineering, School of Engineering, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan.,Department, Institution Laboratoire de Chemoinformatique, UMR7140 University of Strasbourg-CNRS, 4, rue Blaise Pascal, 67000, Strasbourg, France
| | - Gilles Marcou
- Department, Institution Laboratoire de Chemoinformatique, UMR7140 University of Strasbourg-CNRS, 4, rue Blaise Pascal, 67000, Strasbourg, France
| | - Dragos Horvath
- Department, Institution Laboratoire de Chemoinformatique, UMR7140 University of Strasbourg-CNRS, 4, rue Blaise Pascal, 67000, Strasbourg, France
| | - Igor I Baskin
- Faculty of Physics, Lomonosov Moscow State University, Leninskie Gory, 119991, Moscow, Russian Federation
| | - Kimito Funatsu
- Department of Chemical System Engineering, School of Engineering, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan
| | - Alexandre Varnek
- Department, Institution Laboratoire de Chemoinformatique, UMR7140 University of Strasbourg-CNRS, 4, rue Blaise Pascal, 67000, Strasbourg, France.,Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Kita 21 Nishi 10, Kita-ku, 001-0021, Sapporo, Japan
| |
Collapse
|
14
|
Cova TFGG, Pais AACC. Deep Learning for Deep Chemistry: Optimizing the Prediction of Chemical Patterns. Front Chem 2019; 7:809. [PMID: 32039134 PMCID: PMC6988795 DOI: 10.3389/fchem.2019.00809] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 11/11/2019] [Indexed: 12/14/2022] Open
Abstract
Computational Chemistry is currently a synergistic assembly between ab initio calculations, simulation, machine learning (ML) and optimization strategies for describing, solving and predicting chemical data and related phenomena. These include accelerated literature searches, analysis and prediction of physical and quantum chemical properties, transition states, chemical structures, chemical reactions, and also new catalysts and drug candidates. The generalization of scalability to larger chemical problems, rather than specialization, is now the main principle for transforming chemical tasks in multiple fronts, for which systematic and cost-effective solutions have benefited from ML approaches, including those based on deep learning (e.g. quantum chemistry, molecular screening, synthetic route design, catalysis, drug discovery). The latter class of ML algorithms is capable of combining raw input into layers of intermediate features, enabling bench-to-bytes designs with the potential to transform several chemical domains. In this review, the most exciting developments concerning the use of ML in a range of different chemical scenarios are described. A range of different chemical problems and respective rationalization, that have hitherto been inaccessible due to the lack of suitable analysis tools, is thus detailed, evidencing the breadth of potential applications of these emerging multidimensional approaches. Focus is given to the models, algorithms and methods proposed to facilitate research on compound design and synthesis, materials design, prediction of binding, molecular activity, and soft matter behavior. The information produced by pairing Chemistry and ML, through data-driven analyses, neural network predictions and monitoring of chemical systems, allows (i) prompting the ability to understand the complexity of chemical data, (ii) streamlining and designing experiments, (ii) discovering new molecular targets and materials, and also (iv) planning or rethinking forthcoming chemical challenges. In fact, optimization engulfs all these tasks directly.
Collapse
Affiliation(s)
- Tânia F. G. G. Cova
- Coimbra Chemistry Centre, CQC, Department of Chemistry, Faculty of Sciences and Technology, University of Coimbra, Coimbra, Portugal
| | - Alberto A. C. C. Pais
- Coimbra Chemistry Centre, CQC, Department of Chemistry, Faculty of Sciences and Technology, University of Coimbra, Coimbra, Portugal
| |
Collapse
|
15
|
Chakravarti SK, Alla SRM. Descriptor Free QSAR Modeling Using Deep Learning With Long Short-Term Memory Neural Networks. Front Artif Intell 2019; 2:17. [PMID: 33733106 PMCID: PMC7861338 DOI: 10.3389/frai.2019.00017] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 08/22/2019] [Indexed: 12/15/2022] Open
Abstract
Current practice of building QSAR models usually involves computing a set of descriptors for the training set compounds, applying a descriptor selection algorithm and finally using a statistical fitting method to build the model. In this study, we explored the prospects of building good quality interpretable QSARs for big and diverse datasets, without using any pre-calculated descriptors. We have used different forms of Long Short-Term Memory (LSTM) neural networks to achieve this, trained directly using either traditional SMILES codes or a new linear molecular notation developed as part of this work. Three endpoints were modeled: Ames mutagenicity, inhibition of P. falciparum Dd2 and inhibition of Hepatitis C Virus, with training sets ranging from 7,866 to 31,919 compounds. To boost the interpretability of the prediction results, attention-based machine learning mechanism, jointly with a bidirectional LSTM was used to detect structural alerts for the mutagenicity data set. Traditional fragment descriptor-based models were used for comparison. As per the results of the external and cross-validation experiments, overall prediction accuracies of the LSTM models were close to the fragment-based models. However, LSTM models were superior in predicting test chemicals that are dissimilar to the training set compounds, a coveted quality of QSAR models in real world applications. In summary, it is possible to build QSAR models using LSTMs without using pre-computed traditional descriptors, and models are far from being "black box." We wish that this study will be helpful in bringing large, descriptor-less QSARs to mainstream use.
Collapse
|
16
|
Chakravarti SK, Saiakhov RD. Computing similarity between structural environments of mutagenicity alerts. Mutagenesis 2019; 34:55-65. [PMID: 30346583 DOI: 10.1093/mutage/gey032] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 08/16/2018] [Accepted: 09/29/2018] [Indexed: 11/12/2022] Open
Abstract
This article describes a method to generate molecular fingerprints from structural environments of mutagenicity alerts and calculate similarity between them. This approach was used to improve classification accuracy of alerts and for searching structurally similar analogues of an alerting chemical. It builds fingerprints using molecular fragments from the vicinity of the alerts and automatically accounts for the activating and deactivating/mitigating features of alerts needed for accurate predictions. This study also demonstrates the usefulness of transfer learning in which a distributed representation of chemical fragments was first trained on millions of unlabelled chemicals and then used for generating fingerprints and similarity search on smaller data sets labelled with Ames test outcomes. The distributed fingerprints gave better prediction performance and increased coverage compared to traditional binary fingerprints. The methodology was applied to four common mutagenic functionalities-primary aromatic amine, aromatic nitro, epoxide and alkyl chloride. Effects of various hyperparameters on prediction accuracy and test coverage for the k-nearest neighbours prediction method are also described, e.g. similarity thresholds, number of neighbours and size of the alert environment.
Collapse
|
17
|
Baylon JL, Cilfone NA, Gulcher JR, Chittenden TW. Enhancing Retrosynthetic Reaction Prediction with Deep Learning Using Multiscale Reaction Classification. J Chem Inf Model 2019; 59:673-688. [DOI: 10.1021/acs.jcim.8b00801] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Affiliation(s)
- Javier L. Baylon
- Computational Statistics and Bioinformatics Group, Advanced Artificial Intelligence Research Laboratory, WuXi NextCODE Cambridge, Massachusetts 02142, United States
- Complex Biological Systems Alliance, Medford, Massachusetts 02155, United States
| | - Nicholas A. Cilfone
- Computational Statistics and Bioinformatics Group, Advanced Artificial Intelligence Research Laboratory, WuXi NextCODE Cambridge, Massachusetts 02142, United States
- Complex Biological Systems Alliance, Medford, Massachusetts 02155, United States
| | - Jeffrey R. Gulcher
- Computational Statistics and Bioinformatics Group, Advanced Artificial Intelligence Research Laboratory, WuXi NextCODE Cambridge, Massachusetts 02142, United States
- Cancer Genetics Group, WuXi NextCODE, Cambridge, Massachusetts 02142, United States
| | - Thomas W. Chittenden
- Complex Biological Systems Alliance, Medford, Massachusetts 02155, United States
- Computational Statistics and Bioinformatics Group, Advanced Artificial Intelligence Research Laboratory, WuXi NextCODE, Cambridge, Massachusetts 02142, United States
- Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, Massachusetts 02215, United States
| |
Collapse
|