1
|
Tahıl G, Delorme F, Le Berre D, Monflier É, Sayede A, Tilloy S. Stereoisomers Are Not Machine Learning's Best Friends. J Chem Inf Model 2024. [PMID: 38949069 DOI: 10.1021/acs.jcim.4c00318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
This study addresses the challenge of accurately identifying stereoisomers in cheminformatics, which originates from our objective to apply machine learning to predict the association constant between cyclodextrin and a guest. Identifying stereoisomers is indeed crucial for machine learning applications. Current tools offer various molecular descriptors, including their textual representation as Isomeric SMILES that can distinguish stereoisomers. However, such representation is text-based and does not have a fixed size, so a conversion is needed to make it usable to machine learning approaches. Word embedding techniques can be used to solve this problem. Mol2vec, a word embedding approach for molecules, offers such a conversion. Unfortunately, it cannot distinguish between stereoisomers due to its inability to capture the spatial configuration of molecular structures. This study proposes several approaches that use word embedding techniques to handle molecular discrimination using stereochemical information on molecules or considering Isomeric SMILES notation as a text in Natural Language Processing. Our aim is to generate a distinct vector for each unique molecule, correctly identifying stereoisomer information in cheminformatics. The proposed approaches are then compared to our original machine learning task: predicting the association constant between cyclodextrin and a guest molecule.
Collapse
Affiliation(s)
- Gökhan Tahıl
- Centre de Recherche en Informatique de Lens (CRIL)Univ. Artois, CNRS, Centre de Recherche en Informatique de Lens (CRIL), F-62300 Lens, France
- Univ. Artois, CNRS, Centrale Lille, Univ. Lille, UMR 8181, Unité de Catalyse et Chimie du Solide (UCCS), rue Jean Souvraz, SP 18, F-62307 Lens Cedex, France
| | - Fabien Delorme
- Centre de Recherche en Informatique de Lens (CRIL)Univ. Artois, CNRS, Centre de Recherche en Informatique de Lens (CRIL), F-62300 Lens, France
| | - Daniel Le Berre
- Centre de Recherche en Informatique de Lens (CRIL)Univ. Artois, CNRS, Centre de Recherche en Informatique de Lens (CRIL), F-62300 Lens, France
| | - Éric Monflier
- Univ. Artois, CNRS, Centrale Lille, Univ. Lille, UMR 8181, Unité de Catalyse et Chimie du Solide (UCCS), rue Jean Souvraz, SP 18, F-62307 Lens Cedex, France
| | - Adlane Sayede
- Univ. Artois, CNRS, Centrale Lille, Univ. Lille, UMR 8181, Unité de Catalyse et Chimie du Solide (UCCS), rue Jean Souvraz, SP 18, F-62307 Lens Cedex, France
| | - Sébastien Tilloy
- Univ. Artois, CNRS, Centrale Lille, Univ. Lille, UMR 8181, Unité de Catalyse et Chimie du Solide (UCCS), rue Jean Souvraz, SP 18, F-62307 Lens Cedex, France
| |
Collapse
|
2
|
Lovrić M, Wang T, Staffe MR, Šunić I, Časni K, Lasky-Su J, Chawes B, Rasmussen MA. A Chemical Structure and Machine Learning Approach to Assess the Potential Bioactivity of Endogenous Metabolites and Their Association with Early Childhood Systemic Inflammation. Metabolites 2024; 14:278. [PMID: 38786755 PMCID: PMC11122766 DOI: 10.3390/metabo14050278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 04/29/2024] [Accepted: 05/08/2024] [Indexed: 05/25/2024] Open
Abstract
Metabolomics has gained much attention due to its potential to reveal molecular disease mechanisms and present viable biomarkers. This work uses a panel of untargeted serum metabolomes from 602 children from the COPSAC2010 mother-child cohort. The annotated part of the metabolome consists of 517 chemical compounds curated using automated procedures. We created a filtering method for the quantified metabolites using predicted quantitative structure-bioactivity relationships for the Tox21 database on nuclear receptors and stress response in cell lines. The metabolites measured in the children's serums are predicted to affect specific targeted models, known for their significance in inflammation, immune function, and health outcomes. The targets from Tox21 have been used as targets with quantitative structure-activity relationships (QSARs). They were trained for ~7000 structures, saved as models, and then applied to the annotated metabolites to predict their potential bioactivities. The models were selected based on strict accuracy criteria surpassing random effects. After application, 52 metabolites showed potential bioactivity based on structural similarity with known active compounds from the Tox21 set. The filtered compounds were subsequently used and weighted by their bioactive potential to show an association with early childhood hs-CRP levels at six months in a linear model supporting a physiological adverse effect on systemic low-grade inflammation.
Collapse
Affiliation(s)
- Mario Lovrić
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia;
- The Lisbon Council, 1040 Brussels, Belgium
| | - Tingting Wang
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
| | - Mads Rønnow Staffe
- Department of Food Science, University of Copenhagen, 1958 Frederiksberg, Denmark
| | - Iva Šunić
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia;
| | | | - Jessica Lasky-Su
- Department of Medicine, Boston, MA 02115, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Bo Chawes
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
- Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2300 Copenhagen, Denmark
| | - Morten Arendt Rasmussen
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
- Department of Food Science, University of Copenhagen, 1958 Frederiksberg, Denmark
| |
Collapse
|
3
|
He S, Ye X, Dou L, Sakurai T. FIAMol-AB: A feature fusion and attention-based deep learning method for enhanced antibiotic discovery. Comput Biol Med 2024; 168:107762. [PMID: 38056212 DOI: 10.1016/j.compbiomed.2023.107762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2023] [Revised: 10/31/2023] [Accepted: 11/21/2023] [Indexed: 12/08/2023]
Abstract
Antibiotic resistance continues to be a growing concern for global health, accentuating the need for novel antibiotic discoveries. Traditional methodologies in this field have relied heavily on extensive experimental screening, which is often time-consuming and costly. Contrastly, computer-assisted drug screening offers rapid, cost-effective solutions. In this work, we propose FIAMol-AB, a deep learning model that combines graph neural networks, text convolutional networks and molecular fingerprint techniques. This method also combines an attention mechanism to fuse multiple forms of information within the model. The experiments show that FIAMol-AB may offer potential advantages in antibiotic discovery tasks over some existing methods. We conducted some analysis based on our model's results, which help highlight the potential significance of certain features in the model's predictive performance. Compared to different models, ours demonstrate promising results, indicating potential robustness and versatility. This suggests that by integrating multi-view information and attention mechanisms, FIAMol-AB might better learn complex molecular structures, potentially improving the precision and efficiency of antibiotic discovery. We hope our FIAMol-AB can be used as a useful method in the ongoing fight against antibiotic resistance.
Collapse
Affiliation(s)
- Shida He
- Department of Computer Science, University of Tsukuba, Tsukuba, Ibaraki, 305-8577, Japan
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba, Ibaraki, 305-8577, Japan.
| | - Lijun Dou
- Genomic Medicine Institute, Lerner Research Institute, Cleveland, OH, 44106, USA
| | - Tetsuya Sakurai
- Department of Computer Science, University of Tsukuba, Tsukuba, Ibaraki, 305-8577, Japan
| |
Collapse
|
4
|
Moreira-Filho JT, Neves BJ, Cajas RA, Moraes JD, Andrade CH. Artificial intelligence-guided approach for efficient virtual screening of hits against Schistosoma mansoni. Future Med Chem 2023; 15:2033-2050. [PMID: 37937522 DOI: 10.4155/fmc-2023-0152] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 10/06/2023] [Indexed: 11/09/2023] Open
Abstract
Background: The impact of schistosomiasis, which affects over 230 million people, emphasizes the urgency of developing new antischistosomal drugs. Artificial intelligence is vital in accelerating the drug discovery process. Methodology & results: We developed classification and regression machine learning models to predict the schistosomicidal activity of compounds not experimentally tested. The prioritized compounds were tested on schistosomula and adult stages of Schistosoma mansoni. Four compounds demonstrated significant activity against schistosomula, with 50% effective concentration values ranging from 9.8 to 32.5 μM, while exhibiting no toxicity in animal and human cell lines. Conclusion: These findings represent a significant step forward in the discovery of antischistosomal drugs. Further optimization of these active compounds can pave the way for their progression into preclinical studies.
Collapse
Affiliation(s)
- José Teófilo Moreira-Filho
- Laboratory of Molecular Modeling and Drug Design (LabMol), Faculdade de Farmácia, Universidade Federal de Goiás, Goiânia, 74605-170, Brazil
| | - Bruno Junior Neves
- Laboratory of Molecular Modeling and Drug Design (LabMol), Faculdade de Farmácia, Universidade Federal de Goiás, Goiânia, 74605-170, Brazil
| | - Rayssa Araujo Cajas
- Research Center on Neglected Diseases (NPDN), Universidade Guarulhos, Guarulhos, 07023-070, Brazil
| | - Josué de Moraes
- Research Center on Neglected Diseases (NPDN), Universidade Guarulhos, Guarulhos, 07023-070, Brazil
| | - Carolina Horta Andrade
- Laboratory of Molecular Modeling and Drug Design (LabMol), Faculdade de Farmácia, Universidade Federal de Goiás, Goiânia, 74605-170, Brazil
- Center for the Research and Advancement in Fragments and molecular Targets (CRAFT), School of Pharmaceutical Sciences at Ribeirao Preto, University of São Paulo, Ribeirão Preto, SP, Brazil
| |
Collapse
|
5
|
Feature Selection for the Interpretation of Antioxidant Mechanisms in Plant Phenolics. Molecules 2023; 28:molecules28031454. [PMID: 36771125 PMCID: PMC9921549 DOI: 10.3390/molecules28031454] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 01/16/2023] [Accepted: 01/31/2023] [Indexed: 02/05/2023] Open
Abstract
Antioxidants, represented by plant phenolics, protect living tissues by scavenging reactive oxygen species through diverse reaction mechanisms. Research on antioxidants is often individualized, for example, focusing on the evaluation of their activity against a single reactive oxygen species or examining the antioxidant properties of compounds with similar structures. In this study, multivariate analysis was used to comprehensively examine antioxidant properties. Eighteen features were selected to explain the results of the antioxidant capacity tests. These selected features were then evaluated by supervised learning, using the results of the antioxidant capacity assays. Dimension-reduction techniques were also used to represent the compound space with antioxidants as a two-dimensional distribution. A small amount of data obtained from several assays provided us with comprehensive information on the relationships between the structures and activities of antioxidants.
Collapse
|
6
|
Hyper-Mol: Molecular Representation Learning via Fingerprint-Based Hypergraph. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023; 2023:3756102. [PMID: 36776618 PMCID: PMC9908364 DOI: 10.1155/2023/3756102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Revised: 10/14/2022] [Accepted: 10/15/2022] [Indexed: 02/04/2023]
Abstract
With the development of artificial intelligence (AI) in the field of drug design and discovery, learning informative representations of molecules is becoming crucial for those AI-driven tasks. In recent years, the graph neural networks (GNNs) have emerged as a preferred choice of deep learning architecture and have been successfully applied to molecular representation learning (MRL). Up-to-date MRL methods directly apply the message passing mechanism on the atom-level attributes (i.e., atoms and bonds) of molecules. However, they neglect latent yet significant hyperstructured knowledge, such as the information of pharmacophore or functional class. Hence, in this paper, we propose Hyper-Mol, a new MRL framework that applies GNNs to encode hypergraph structures of molecules via fingerprint-based features. Hyper-Mol explores the hyperstructured knowledge and the latent relationships of the fingerprint substructures from a hypergraph perspective. The molecular hypergraph generation algorithm is designed to depict the hyperstructured information with the physical and chemical characteristics of molecules. Thus, the fingerprint-level message passing process can encode both the intra-structured and inter-structured information of fingerprint substructures according to the molecular hypergraphs. We evaluate Hyper-Mol on molecular property prediction tasks, and the experimental results on real-world benchmarks show that Hyper-Mol can learn comprehensive hyperstructured knowledge of molecules and is superior to the state-of-the-art baselines.
Collapse
|
7
|
Hu X, Du T, Dai S, Wei F, Chen X, Ma S. Identification of intrinsic hepatotoxic compounds in Polygonum multiflorum Thunb. using machine-learning methods. JOURNAL OF ETHNOPHARMACOLOGY 2022; 298:115620. [PMID: 35963419 DOI: 10.1016/j.jep.2022.115620] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 08/01/2022] [Accepted: 08/05/2022] [Indexed: 05/02/2023]
Abstract
ETHNOPHARMACOLOGICAL RELEVANCE Polygonum multiflorum Thunb. (PM) is a herb, extracts of which have been used as Chinese medicine for years. Although it is believed to be beneficial to the liver, heart, and kidneys, it causes idiosyncratic drug-induced liver injury (DILI). AIM OF THE STUDY We propose that the intrinsic DILI caused by natural products in PM (NPPM) is an important complementary mechanism to PM-related herb-induced liver injury, and aim to identify the ingredients with high DILI potential by machine learning methods. MATERIALS AND METHODS One hundred and ninety-seven NPPM were collected from the literature to identify the intrinsic hepatotoxic compounds. Additionally, a DILI-labeled dataset consisting of 2384 compounds was collected and randomly split into training and test sets. A diparametric optimization method was developed to tune the parameters of extended-connectivity fingerprints (ECFPs), Rdkit, and atom-pair fingerprints as well as those of machine-learning (ML) algorithms. Subsequently, K means were employed to cluster the NPPM that were predicted to have a high DILI risk. An in vitro cell-viability assay was performed using HepaRG cells to validate the prediction results. RESULTS ECFPs with the top 35% of features ranked by the F-value with support vector machine (SVM) yielded the best performance. The optimized SVM model achieved an accuracy of 0.761 and recall value of 0.834 on the test dataset. The silico screening for NPPM resulted in 47 ingredients with high DILI potential, which were clustered into six groups based on the elbow method. A representative subgroup that contained 21 ingredients, of which two dianthrones exhibited the lowest IC50 value (0.7-0.9 μM) and anthraquinones showed moderate toxicity (15-25 μM), was constructed. CONCLUSION Using ML methods and in vitro screening, two classes of compounds, dianthrones and anthraquinones, were predicted and validated to have a high risk of DILI. The diparametric optimization method used in this study could provide a useful and powerful tool to screen toxicants for large datasets and is available at https://github.com/dreadlesss/Hepatotoxicity_predictor.
Collapse
Affiliation(s)
- Xiaowen Hu
- National Institutes for Food and Drug Control, Institute for Control of Chinese Traditional Medicine and Ethnic Medicine, Beijing, 102629, China
| | - Tingting Du
- Chinese Academy of Medical Science and Peking Union Medical College, Institute of Materia Medica, Beijing, 100006, China
| | - Shengyun Dai
- National Institutes for Food and Drug Control, Institute for Control of Chinese Traditional Medicine and Ethnic Medicine, Beijing, 102629, China
| | - Feng Wei
- National Institutes for Food and Drug Control, Institute for Control of Chinese Traditional Medicine and Ethnic Medicine, Beijing, 102629, China
| | - Xiaoguang Chen
- Chinese Academy of Medical Science and Peking Union Medical College, Institute of Materia Medica, Beijing, 100006, China.
| | - Shuangcheng Ma
- National Institutes for Food and Drug Control, Institute for Control of Chinese Traditional Medicine and Ethnic Medicine, Beijing, 102629, China.
| |
Collapse
|
8
|
Yang J, Cai Y, Zhao K, Xie H, Chen X. Concepts and applications of chemical fingerprint for hit and lead screening. Drug Discov Today 2022; 27:103356. [PMID: 36113834 DOI: 10.1016/j.drudis.2022.103356] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 07/28/2022] [Accepted: 09/08/2022] [Indexed: 11/22/2022]
Abstract
Molecular fingerprints are used to represent chemical (structural, physicochemical, etc.) properties of large-scale chemical sets in a low computational cost way. They have a prominent role in transforming chemical data sets into consistent input formats (bit strings or numeric values) suitable for in silico approaches. In this review, we summarize and classify common and state-of-the-art fingerprints into eight different types (dictionary based, circular, topological, pharmacophore, protein-ligand interaction, shape based, reinforced, and multi). We also highlight applications of fingerprints in early drug research and development (R&D). Thus, this review provides a guide for the selection of appropriate fingerprints of compounds (or ligand-protein complexes) for use in drug R&D.
Collapse
Affiliation(s)
- Jingbo Yang
- Department of Pharmagenomics, College of Bioinformatics Science and Technology, Harbin Medical University, 150081 Harbin, Heilongjiang, China
| | - Yiyang Cai
- Department of Pharmagenomics, College of Bioinformatics Science and Technology, Harbin Medical University, 150081 Harbin, Heilongjiang, China
| | - Kairui Zhao
- Department of Pharmagenomics, College of Bioinformatics Science and Technology, Harbin Medical University, 150081 Harbin, Heilongjiang, China
| | - Hongbo Xie
- Department of Pharmagenomics, College of Bioinformatics Science and Technology, Harbin Medical University, 150081 Harbin, Heilongjiang, China.
| | - Xiujie Chen
- Department of Pharmagenomics, College of Bioinformatics Science and Technology, Harbin Medical University, 150081 Harbin, Heilongjiang, China.
| |
Collapse
|
9
|
Devillers J, Sartor V, Devillers H. Predicting mosquito repellents for clothing application from molecular fingerprint-based artificial neural network SAR models. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2022; 33:729-751. [PMID: 36106833 DOI: 10.1080/1062936x.2022.2124014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Accepted: 09/06/2022] [Indexed: 06/15/2023]
Abstract
Spraying repellents on clothing limits toxicity and allergy problems that can occur when the repellents are directly applied to skin. This also allows the use of higher doses to ensure longer lasting effects. As the number of repellents available on the market is limited, it is necessary to propose new ones, especially by using in silico methods that reduce costs and time. In this context SAR models were built from a dataset of 2027 chemicals for which repellent activity on clothing was measured against Aedes aegypti. The interest of using either the ECFP or MACCS fingerprints as input neurons of a three-layer perceptron was evaluated. Transformation of MACCS bit strings into disjunctive tables led to interesting results. Models obtained with both types of fingerprints were compared to a model including physicochemical and topological descriptors.
Collapse
Affiliation(s)
| | - V Sartor
- Laboratoire des IMRCP, Université de Toulouse, CNRS UMR 5623, Université Toulouse III - Paul Sabatier, Toulouse, France
| | - H Devillers
- SPO, Univ Montpellier, INRAE, Institut Agro, Montpellier, France
| |
Collapse
|
10
|
Zierep PF, Vita R, Blazeska N, Moumbock AFA, Greenbaum JA, Peters B, Günther S. Towards the prediction of non-peptidic epitopes. PLoS Comput Biol 2022; 18:e1009151. [PMID: 35180214 PMCID: PMC8893639 DOI: 10.1371/journal.pcbi.1009151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Revised: 03/03/2022] [Accepted: 01/23/2022] [Indexed: 11/19/2022] Open
Abstract
In-silico methods for the prediction of epitopes can support and improve workflows for vaccine design, antibody production, and disease therapy. So far, the scope of B cell and T cell epitope prediction has been directed exclusively towards peptidic antigens. Nevertheless, various non-peptidic molecular classes can be recognized by immune cells. These compounds have not been systematically studied yet, and prediction approaches are lacking. The ability to predict the epitope activity of non-peptidic compounds could have vast implications; for example, for immunogenic risk assessment of the vast number of drugs and other xenobiotics. Here we present the first general attempt to predict the epitope activity of non-peptidic compounds using the Immune Epitope Database (IEDB) as a source for positive samples. The molecules stored in the Chemical Entities of Biological Interest (ChEBI) database were chosen as background samples. The molecules were clustered into eight homogeneous molecular groups, and classifiers were built for each cluster with the aim of separating the epitopes from the background. Different molecular feature encoding schemes and machine learning models were compared against each other. For those models where a high performance could be achieved based on simple decision rules, the molecular features were then further investigated. Additionally, the findings were used to build a web server that allows for the immunogenic investigation of non-peptidic molecules (http://tools-staging.iedb.org/np_epitope_predictor). The prediction quality was tested with samples from independent evaluation datasets, and the implemented method received noteworthy Receiver Operating Characteristic-Area Under Curve (ROC-AUC) values, ranging from 0.69–0.96 depending on the molecule cluster. Small molecules found in cosmetics, foodstuffs, dyes, and industrial materials, but also those produced by plants, bacteria, and animals can trigger strong reactions of the human immune system and can therefore be hazardous to health. In the present work, several thousand immune-reactive small molecules (so-called non-peptidic epitopes) were classified by molecular structure and studied with the aim of identifying specific parts of the molecules responsible for such immune responses. Using a machine-learning approach (random forests and neural networks), we identified some substructures that appear strikingly often in non-peptidic epitopes and which may be responsible for the hazardous immune response. Such knowledge may help to explain allergic reactions to chemicals and also to minimize the health risks of new chemicals in industrial production. To support this endeavor, we have implemented the method in a publicly available web application. This can be used for the prediction and identification of non-peptidic epitopes and their underlying substructures.
Collapse
Affiliation(s)
- Paul F. Zierep
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Randi Vita
- Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, California, United States of America
| | - Nina Blazeska
- Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, California, United States of America
| | - Aurélien F. A. Moumbock
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Jason A. Greenbaum
- Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, California, United States of America
| | - Bjoern Peters
- Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, California, United States of America
- * E-mail: (BP); (SG)
| | - Stefan Günther
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
- * E-mail: (BP); (SG)
| |
Collapse
|
11
|
Lovrić M, Đuričić T, Tran HTN, Hussain H, Lacić E, Rasmussen MA, Kern R. Should We Embed in Chemistry? A Comparison of Unsupervised Transfer Learning with PCA, UMAP, and VAE on Molecular Fingerprints. Pharmaceuticals (Basel) 2021; 14:758. [PMID: 34451855 PMCID: PMC8400160 DOI: 10.3390/ph14080758] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 07/21/2021] [Accepted: 07/22/2021] [Indexed: 02/07/2023] Open
Abstract
Methods for dimensionality reduction are showing significant contributions to knowledge generation in high-dimensional modeling scenarios throughout many disciplines. By achieving a lower dimensional representation (also called embedding), fewer computing resources are needed in downstream machine learning tasks, thus leading to a faster training time, lower complexity, and statistical flexibility. In this work, we investigate the utility of three prominent unsupervised embedding techniques (principal component analysis-PCA, uniform manifold approximation and projection-UMAP, and variational autoencoders-VAEs) for solving classification tasks in the domain of toxicology. To this end, we compare these embedding techniques against a set of molecular fingerprint-based models that do not utilize additional pre-preprocessing of features. Inspired by the success of transfer learning in several fields, we further study the performance of embedders when trained on an external dataset of chemical compounds. To gain a better understanding of their characteristics, we evaluate the embedders with different embedding dimensionalities, and with different sizes of the external dataset. Our findings show that the recently popularized UMAP approach can be utilized alongside known techniques such as PCA and VAE as a pre-compression technique in the toxicology domain. Nevertheless, the generative model of VAE shows an advantage in pre-compressing the data with respect to classification accuracy.
Collapse
Affiliation(s)
- Mario Lovrić
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (T.Đ.); (H.T.N.T.); (H.H.); (E.L.)
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia
| | - Tomislav Đuričić
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (T.Đ.); (H.T.N.T.); (H.H.); (E.L.)
- Institute of Interactive Systems and Data Science, Graz University of Technology, Inffeldgasse 16C, 8010 Graz, Austria
| | - Han T. N. Tran
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (T.Đ.); (H.T.N.T.); (H.H.); (E.L.)
| | - Hussain Hussain
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (T.Đ.); (H.T.N.T.); (H.H.); (E.L.)
- Institute of Interactive Systems and Data Science, Graz University of Technology, Inffeldgasse 16C, 8010 Graz, Austria
| | - Emanuel Lacić
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (T.Đ.); (H.T.N.T.); (H.H.); (E.L.)
| | - Morten A. Rasmussen
- Copenhagen Studies on Asthma in Childhood, Herlev-Gentofte Hospital, University of Copenhagen, Ledreborg Alle 34, 2820 Gentofte, Denmark;
- Department of Food Science, University of Copenhagen, Rolighedsvej 26, 1958 Frederiksberg, Denmark
| | - Roman Kern
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (T.Đ.); (H.T.N.T.); (H.H.); (E.L.)
- Institute of Interactive Systems and Data Science, Graz University of Technology, Inffeldgasse 16C, 8010 Graz, Austria
| |
Collapse
|
12
|
McGill C, Forsuelo M, Guan Y, Green WH. Predicting Infrared Spectra with Message Passing Neural Networks. J Chem Inf Model 2021; 61:2594-2609. [PMID: 34048221 DOI: 10.1021/acs.jcim.1c00055] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Infrared (IR) spectroscopy remains an important tool for chemical characterization and identification. Chemprop-IR has been developed as a software package for the prediction of IR spectra through the use of machine learning. This work serves the dual purpose of providing a trained general-purpose model for the prediction of IR spectra with ease and providing the Chemprop-IR software framework for the training of new models. In Chemprop-IR, molecules are encoded using a directed message passing neural network, allowing for molecule latent representations to be learned and optimized for the task of spectral predictions. Model training incorporates spectra metrics and normalization techniques that offer better performance with spectral predictions than standard practice in regression models. The model makes use of pretraining using quantum chemistry calculations and ensembling of multiple submodels to improve generalizability and performance. The spectral predictions that result are of high quality, showing capability to capture the extreme diversity of spectral forms over chemical space and represent complex peak structures.
Collapse
Affiliation(s)
- Charles McGill
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Michael Forsuelo
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Yanfei Guan
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
13
|
Lovrić M, Malev O, Klobučar G, Kern R, Liu JJ, Lučić B. Predictive Capability of QSAR Models Based on the CompTox Zebrafish Embryo Assays: An Imbalanced Classification Problem. Molecules 2021; 26:1617. [PMID: 33803931 PMCID: PMC7998177 DOI: 10.3390/molecules26061617] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Revised: 03/03/2021] [Accepted: 03/11/2021] [Indexed: 02/06/2023] Open
Abstract
The CompTox Chemistry Dashboard (ToxCast) contains one of the largest public databases on Zebrafish (Danio rerio) developmental toxicity. The data consists of 19 toxicological endpoints on unique 1018 compounds measured in relatively low concentration ranges. The endpoints are related to developmental effects occurring in dechorionated zebrafish embryos for 120 hours post fertilization and monitored via gross malformations and mortality. We report the predictive capability of 209 quantitative structure-activity relationship (QSAR) models developed by machine learning methods using penalization techniques and diverse model quality metrics to cope with the imbalanced endpoints. All these QSAR models were generated to test how the imbalanced classification (toxic or non-toxic) endpoints could be predicted regardless which of three algorithms is used: logistic regression, multi-layer perceptron, or random forests. Additionally, QSAR toxicity models are developed starting from sets of classical molecular descriptors, structural fingerprints and their combinations. Only 8 out of 209 models passed the 0.20 Matthew's correlation coefficient value defined a priori as a threshold for acceptable model quality on the test sets. The best models were obtained for endpoints mortality (MORT), ActivityScore and JAW (deformation). The low predictability of the QSAR model developed from the zebrafish embryotoxicity data in the database is mainly due to a higher sensitivity of 19 measurements of endpoints carried out on dechorionated embryos at low concentrations.
Collapse
Affiliation(s)
- Mario Lovrić
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (R.K.)
- Ruđer Bošković Institute, P.O. Box 180, 10002 Zagreb, Croatia;
| | - Olga Malev
- Ruđer Bošković Institute, P.O. Box 180, 10002 Zagreb, Croatia;
- Department of Biology, Faculty of Science, University of Zagreb, Rooseveltov Trg 6, 10000 Zagreb, Croatia;
| | - Göran Klobučar
- Department of Biology, Faculty of Science, University of Zagreb, Rooseveltov Trg 6, 10000 Zagreb, Croatia;
| | - Roman Kern
- Know-Center, Inffeldgasse 13, 8010 Graz, Austria; (M.L.); (R.K.)
- Institute of Interactive Systems and Data Science, TU Graz, Inffeldgasse 16c, 8010 Graz, Austria
| | - Jay J. Liu
- Department of Chemical Engineering, Pukyong National University, Busan 608-739, Korea
| | - Bono Lučić
- Ruđer Bošković Institute, P.O. Box 180, 10002 Zagreb, Croatia;
| |
Collapse
|
14
|
Challa AP, Beam AL, Shen M, Peryea T, Lavieri RR, Lippmann ES, Aronoff DM. Machine learning on drug-specific data to predict small molecule teratogenicity. Reprod Toxicol 2020; 95:148-158. [PMID: 32428651 PMCID: PMC7577422 DOI: 10.1016/j.reprotox.2020.05.004] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2019] [Revised: 05/04/2020] [Accepted: 05/06/2020] [Indexed: 12/23/2022]
Abstract
Pregnant women are an especially vulnerable population, given the sensitivity of a developing fetus to chemical exposures. However, prescribing behavior for the gravid patient is guided on limited human data and conflicting cases of adverse outcomes due to the exclusion of pregnant populations from randomized, controlled trials. These factors increase risk for adverse drug outcomes and reduce quality of care for pregnant populations. Herein, we propose the application of artificial intelligence to systematically predict the teratogenicity of a prescriptible small molecule from information inherent to the drug. Using unsupervised and supervised machine learning, our model probes all small molecules with known structure and teratogenicity data published in research-amenable formats to identify patterns among structural, meta-structural, and in vitro bioactivity data for each drug and its teratogenicity score. With this workflow, we discovered three chemical functionalities that predispose a drug towards increased teratogenicity and two moieties with potentially protective effects. Our models predict three clinically-relevant classes of teratogenicity with AUC = 0.8 and nearly double the predictive accuracy of a blind control for the same task, suggesting successful modeling. We also present extensive barriers to translational research that restrict data-driven studies in pregnancy and therapeutically "orphan" pregnant populations. Collectively, this work represents a first-in-kind platform for the application of computing to study and predict teratogenicity.
Collapse
Affiliation(s)
- Anup P Challa
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville 37203, TN, United States; Department of Biomedical Informatics, Harvard Medical School, Boston 02115, MA, United States; National Center for Advancing Translational Sciences, National Institutes of Health, Rockville 20850, MD, United States; Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville 37212, TN, United States.
| | - Andrew L Beam
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston 02115, MA, United States; Department of Biomedical Informatics, Harvard Medical School, Boston 02115, MA, United States
| | - Min Shen
- National Center for Advancing Translational Sciences, National Institutes of Health, Rockville 20850, MD, United States
| | - Tyler Peryea
- National Center for Advancing Translational Sciences, National Institutes of Health, Rockville 20850, MD, United States
| | - Robert R Lavieri
- Vanderbilt Institute for Clinical and Translational Research, Vanderbilt University Medical Center, Nashville 37203, TN, United States
| | - Ethan S Lippmann
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville 37212, TN, United States
| | - David M Aronoff
- Division of Infectious Diseases, Department of Medicine, Vanderbilt University Medical Center, Nashville 37203, TN, United States; Department of Obstetrics and Gynecology, Vanderbilt University Medical Center, Nashville 37203, TN, United States; Department of Pathology, Microbiology and Immunology, Vanderbilt University Medical Center, Nashville 37203, TN, United States
| |
Collapse
|
15
|
Revealing cytotoxic substructures in molecules using deep learning. J Comput Aided Mol Des 2020; 34:731-746. [PMID: 32297073 PMCID: PMC7292813 DOI: 10.1007/s10822-020-00310-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Accepted: 03/16/2020] [Indexed: 01/22/2023]
Abstract
In drug development, late stage toxicity issues of a compound are the main cause of failure in clinical trials. In silico methods are therefore of high importance to guide the early design process to reduce time, costs and animal testing. Technical advances and the ever growing amount of available toxicity data enabled machine learning, especially neural networks, to impact the field of predictive toxicology. In this study, cytotoxicity prediction, one of the earliest handles in drug discovery, is investigated using a deep learning approach trained on a highly consistent in-house data set of over 34,000 compounds with a share of less than 5% of cytotoxic molecules. The model reached a balanced accuracy of over 70%, similar to previously reported studies using Random Forest. Albeit yielding good results, neural networks are often described as a black box lacking deeper mechanistic understanding of the underlying model. To overcome this absence of interpretability, a Deep Taylor Decomposition method is investigated to identify substructures that may be responsible for the cytotoxic effects, the so-called toxicophores. Furthermore, this study introduces cytotoxicity maps which provide a visual structural interpretation of the relevance of these substructures. Using this approach could be helpful in drug development to predict the potential toxicity of a compound as well as to generate new insights into the toxic mechanism. Moreover, it could also help to de-risk and optimize compounds.
Collapse
|
16
|
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 2019; 18:463-477. [PMID: 30976107 DOI: 10.1038/s41573-019-0024-5] [Citation(s) in RCA: 931] [Impact Index Per Article: 186.2] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.
Collapse
Affiliation(s)
- Jessica Vamathevan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK.
| | - Dominic Clark
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | | | - Ian Dunham
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Edgardo Ferran
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - George Lee
- Bristol-Myers Squibb, Princeton, NJ, USA
| | - Bin Li
- Takeda Pharmaceuticals International Co., Cambridge, MA, USA
| | - Anant Madabhushi
- Case Western Reserve University, Cleveland, OH, USA.,Louis Stokes Cleveland Veterans Affair Medical Center, Cleveland, OH, USA
| | | | - Michaela Spitzer
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Shanrong Zhao
- Pfizer Worldwide Research and Development, Cambridge, MA, USA
| |
Collapse
|
17
|
|
18
|
Vachery J, Ranu S. RISC: Rapid Inverted-Index Based Search of Chemical Fingerprints. J Chem Inf Model 2019; 59:2702-2713. [PMID: 30908028 DOI: 10.1021/acs.jcim.9b00069] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The ability to search for a query molecule on massive molecular repositories is a fundamental task in chemoinformatics and drug-discovery. Chemical fingerprints are commonly used to characterize the structure and properties of molecules. Some fingerprints, particularly unfolded fingerprints, are often of extreme high dimension and sparse where only few features have a positive value. In this work, we propose a new searching algorithm, RISC, which exploits sparsity in high-dimensional fingerprints to derive effective pruning mechanisms and dramatically speed-up searching efficiency. RISC is robust enough to work on both binary and nonbinary chemical fingerprints. Extensive experiments on Range Queries and Top-k Queries across several molecular repositories demonstrate that at fingerprints of dimension 2048 and above, which is often the case with unfolded fingerprints, RISC is consistently faster than the state-of-the-art techniques. The source code of our implementation is available at http://www.cse.iitd.ac.in/~sayan/software.html .
Collapse
Affiliation(s)
- Jithin Vachery
- Department of Computer Science , IIT-Madras , Chennai , 600036 , India
| | - Sayan Ranu
- Department of Computer Science , IIT-Delhi , New Delhi , 110016 , India
| |
Collapse
|
19
|
Van Vleet TR, Liguori MJ, Lynch JJ, Rao M, Warder S. Screening Strategies and Methods for Better Off-Target Liability Prediction and Identification of Small-Molecule Pharmaceuticals. SLAS DISCOVERY 2018; 24:1-24. [PMID: 30196745 DOI: 10.1177/2472555218799713] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Pharmaceutical discovery and development is a long and expensive process that, unfortunately, still results in a low success rate, with drug safety continuing to be a major impedance. Improved safety screening strategies and methods are needed to more effectively fill this critical gap. Recent advances in informatics are now making it possible to manage bigger data sets and integrate multiple sources of screening data in a manner that can potentially improve the selection of higher-quality drug candidates. Integrated screening paradigms have become the norm in Pharma, both in discovery screening and in the identification of off-target toxicity mechanisms during later-stage development. Furthermore, advances in computational methods are making in silico screens more relevant and suggest that they may represent a feasible option for augmenting the current screening paradigm. This paper outlines several fundamental methods of the current drug screening processes across Pharma and emerging techniques/technologies that promise to improve molecule selection. In addition, the authors discuss integrated screening strategies and provide examples of advanced screening paradigms.
Collapse
Affiliation(s)
- Terry R Van Vleet
- 1 Department of Investigative Toxicology and Pathology, AbbVie, N Chicago, IL, USA
| | - Michael J Liguori
- 1 Department of Investigative Toxicology and Pathology, AbbVie, N Chicago, IL, USA
| | - James J Lynch
- 2 Department of Integrated Science and Technology, AbbVie, N Chicago, IL, USA
| | - Mohan Rao
- 1 Department of Investigative Toxicology and Pathology, AbbVie, N Chicago, IL, USA
| | - Scott Warder
- 3 Department of Target Enabling Science and Technology, AbbVie, N Chicago, IL, USA
| |
Collapse
|