1
|
Le T, Winter R, Noé F, Clevert DA. Neuraldecipher - reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures. Chem Sci 2020; 11:10378-10389. [PMID: 34094299 PMCID: PMC8162443 DOI: 10.1039/d0sc03115a] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2020] [Accepted: 09/10/2020] [Indexed: 12/22/2022] Open
Abstract
Protecting molecular structures from disclosure against external parties is of great relevance for industrial and private associations, such as pharmaceutical companies. Within the framework of external collaborations, it is common to exchange datasets by encoding the molecular structures into descriptors. Molecular fingerprints such as the extended-connectivity fingerprints (ECFPs) are frequently used for such an exchange, because they typically perform well on quantitative structure-activity relationship tasks. ECFPs are often considered to be non-invertible due to the way they are computed. In this paper, we present a fast reverse-engineering method to deduce the molecular structure given revealed ECFPs. Our method includes the Neuraldecipher, a neural network model that predicts a compact vector representation of compounds, given ECFPs. We then utilize another pre-trained model to retrieve the molecular structure as SMILES representation. We demonstrate that our method is able to reconstruct molecular structures to some extent, and improves, when ECFPs with larger fingerprint sizes are revealed. For example, given ECFP count vectors of length 4096, we are able to correctly deduce up to 69% of molecular structures on a validation set (112 K unique samples) with our method.
Collapse
Affiliation(s)
- Tuan Le
- Department of Digital Technologies, Bayer AG Berlin Germany
- Department of Mathematics and Computer Science, Freie Universität Berlin Berlin Germany
| | - Robin Winter
- Department of Digital Technologies, Bayer AG Berlin Germany
- Department of Mathematics and Computer Science, Freie Universität Berlin Berlin Germany
| | - Frank Noé
- Department of Mathematics and Computer Science, Freie Universität Berlin Berlin Germany
| | | |
Collapse
|
2
|
Keyvanpour MR, Shirzad MB. An Analysis of QSAR Research Based on Machine Learning Concepts. Curr Drug Discov Technol 2020; 18:17-30. [PMID: 32178612 DOI: 10.2174/1570163817666200316104404] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 08/22/2019] [Accepted: 10/28/2019] [Indexed: 11/22/2022]
Abstract
Quantitative Structure-Activity Relationship (QSAR) is a popular approach developed to correlate chemical molecules with their biological activities based on their chemical structures. Machine learning techniques have proved to be promising solutions to QSAR modeling. Due to the significant role of machine learning strategies in QSAR modeling, this area of research has attracted much attention from researchers. A considerable amount of literature has been published on machine learning based QSAR modeling methodologies whilst this domain still suffers from lack of a recent and comprehensive analysis of these algorithms. This study systematically reviews the application of machine learning algorithms in QSAR, aiming to provide an analytical framework. For this purpose, we present a framework called 'ML-QSAR'. This framework has been designed for future research to: a) facilitate the selection of proper strategies among existing algorithms according to the application area requirements, b) help to develop and ameliorate current methods and c) providing a platform to study existing methodologies comparatively. In ML-QSAR, first a structured categorization is depicted which studied the QSAR modeling research based on machine models. Then several criteria are introduced in order to assess the models. Finally, inspired by aforementioned criteria the qualitative analysis is carried out.
Collapse
Affiliation(s)
| | - Mehrnoush Barani Shirzad
- Data Mining Research Laboratory, Department of Computer Engineering, Alzahra University, Tehran, Iran
| |
Collapse
|
3
|
Pastor M, Quintana J, Sanz F. Development of an Infrastructure for the Prediction of Biological Endpoints in Industrial Environments. Lessons Learned at the eTOX Project. Front Pharmacol 2018; 9:1147. [PMID: 30364191 PMCID: PMC6193068 DOI: 10.3389/fphar.2018.01147] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 09/21/2018] [Indexed: 11/13/2022] Open
Abstract
In silico methods are increasingly being used for assessing the chemical safety of substances, as a part of integrated approaches involving in vitro and in vivo experiments. A paradigmatic example of these strategies is the eTOX project http://www.etoxproject.eu, funded by the European Innovative Medicines Initiative (IMI), which aimed at producing high quality predictions of in vivo toxicity of drug candidates and resulted in generating about 200 models for diverse endpoints of toxicological interest. In an industry-oriented project like eTOX, apart from the predictive quality, the models need to meet other quality parameters related to the procedures for their generation and their intended use. For example, when the models are used for predicting the properties of drug candidates, the prediction system must guarantee the complete confidentiality of the compound structures. The interface of the system must be designed to provide non-expert users all the information required to choose the models and appropriately interpret the results. Moreover, procedures like installation, maintenance, documentation, validation and versioning, which are common in software development, must be also implemented for the models and for the prediction platform in which they are implemented. In this article we describe our experience in the eTOX project and the lessons learned after 7 years of close collaboration between industrial and academic partners. We believe that some of the solutions found and the tools developed could be useful for supporting similar initiatives in the future.
Collapse
Affiliation(s)
| | | | - Ferran Sanz
- *Correspondence: Manuel Pastor, Ferran Sanz,
| |
Collapse
|
4
|
Gedeck P, Skolnik S, Rodde S. Developing Collaborative QSAR Models Without Sharing Structures. J Chem Inf Model 2017; 57:1847-1858. [DOI: 10.1021/acs.jcim.7b00315] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Peter Gedeck
- Peter Gedeck LLC, 2309 Grove Avenue, Falls Church, Virginia 22046, United States
| | - Suzanne Skolnik
- Novartis Institute for Biomedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Stephane Rodde
- Novartis Institute for Biomedical Research, Postfach, CH-4002 Basel, Switzerland
| |
Collapse
|
5
|
Verras A, Waller CL, Gedeck P, Green DVS, Kogej T, Raichurkar A, Panda M, Shelat AA, Clark J, Guy RK, Papadatos G, Burrows J. Shared Consensus Machine Learning Models for Predicting Blood Stage Malaria Inhibition. J Chem Inf Model 2017; 57:445-453. [DOI: 10.1021/acs.jcim.6b00572] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Andreas Verras
- Merck & Co., Inc., Kenilworth, New Jersey 07033, United States
| | - Chris L. Waller
- Merck & Co., Inc., Boston, Massachusetts 02210, United States
| | - Peter Gedeck
- Novartis Institute for Tropical Diseases Pte. Ltd., Singapore 138670, Singapore
| | | | | | | | | | - Anang A. Shelat
- Chemical
Biology and Therapeutics Department, St. Jude Children’s Research Hospital, Memphis, Tennessee 38105, United States
| | - Julie Clark
- Chemical
Biology and Therapeutics Department, St. Jude Children’s Research Hospital, Memphis, Tennessee 38105, United States
| | - R. Kiplin Guy
- Chemical
Biology and Therapeutics Department, St. Jude Children’s Research Hospital, Memphis, Tennessee 38105, United States
| | - George Papadatos
- European
Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Jeremy Burrows
- Medicines for Malaria Ventures Discovery, Geneva 1215, Switzerland
| |
Collapse
|
6
|
Shoombuatong W, Prathipati P, Owasirikul W, Worachartcheewan A, Simeon S, Anuwongcharoen N, Wikberg JES, Nantasenamat C. Towards the Revival of Interpretable QSAR Models. CHALLENGES AND ADVANCES IN COMPUTATIONAL CHEMISTRY AND PHYSICS 2017. [DOI: 10.1007/978-3-319-56850-8_1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
7
|
Gabb HA, Blake C. An Informatics Approach to Evaluating Combined Chemical Exposures from Consumer Products: A Case Study of Asthma-Associated Chemicals and Potential Endocrine Disruptors. ENVIRONMENTAL HEALTH PERSPECTIVES 2016; 124:1155-65. [PMID: 26955064 PMCID: PMC4977060 DOI: 10.1289/ehp.1510529] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2015] [Revised: 09/25/2015] [Accepted: 02/18/2016] [Indexed: 05/20/2023]
Abstract
BACKGROUND Simultaneous or sequential exposure to multiple environmental stressors can affect chemical toxicity. Cumulative risk assessments consider multiple stressors but it is impractical to test every chemical combination to which people are exposed. New methods are needed to prioritize chemical combinations based on their prevalence and possible health impacts. OBJECTIVES We introduce an informatics approach that uses publicly available data to identify chemicals that co-occur in consumer products, which account for a significant proportion of overall chemical load. METHODS Fifty-five asthma-associated and endocrine disrupting chemicals (target chemicals) were selected. A database of 38,975 distinct consumer products and 32,231 distinct ingredient names was created from online sources, and PubChem and the Unified Medical Language System were used to resolve synonymous ingredient names. Synonymous ingredient names are different names for the same chemical (e.g., vitamin E and tocopherol). RESULTS Nearly one-third of the products (11,688 products, 30%) contained ≥ 1 target chemical and 5,229 products (13%) contained > 1. Of the 55 target chemicals, 31 (56%) appear in ≥ 1 product and 19 (35%) appear under more than one name. The most frequent three-way chemical combination (2-phenoxyethanol, methyl paraben, and ethyl paraben) appears in 1,059 products. Further work is needed to assess combined chemical exposures related to the use of multiple products. CONCLUSIONS The informatics approach increased the number of products considered in a traditional analysis by two orders of magnitude, but missing/incomplete product labels can limit the effectiveness of this approach. Such an approach must resolve synonymy to ensure that chemicals of interest are not missed. Commonly occurring chemical combinations can be used to prioritize cumulative toxicology risk assessments. CITATION Gabb HA, Blake C. 2016. An informatics approach to evaluating combined chemical exposures from consumer products: a case study of asthma-associated chemicals and potential endocrine disruptors. Environ Health Perspect 124:1155-1165; http://dx.doi.org/10.1289/ehp.1510529.
Collapse
Affiliation(s)
- Henry A. Gabb
- Address correspondence to H.A. Gabb, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 E. Daniel St., Champaign, IL 61820 USA. Telephone: (217) 419-2625. E-mail:
| | | |
Collapse
|
8
|
Ekins S, Clark AM, Swamidass SJ, Litterman N, Williams AJ. Bigger data, collaborative tools and the future of predictive drug discovery. J Comput Aided Mol Des 2014; 28:997-1008. [PMID: 24943138 PMCID: PMC4198464 DOI: 10.1007/s10822-014-9762-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2014] [Accepted: 06/09/2014] [Indexed: 12/31/2022]
Abstract
Over the past decade we have seen a growth in the provision of chemistry data and cheminformatics tools as either free websites or software as a service commercial offerings. These have transformed how we find molecule-related data and use such tools in our research. There have also been efforts to improve collaboration between researchers either openly or through secure transactions using commercial tools. A major challenge in the future will be how such databases and software approaches handle larger amounts of data as it accumulates from high throughput screening and enables the user to draw insights, enable predictions and move projects forward. We now discuss how information from some drug discovery datasets can be made more accessible and how privacy of data should not overwhelm the desire to share it at an appropriate time with collaborators. We also discuss additional software tools that could be made available and provide our thoughts on the future of predictive drug discovery in this age of big data. We use some examples from our own research on neglected diseases, collaborations, mobile apps and algorithm development to illustrate these ideas.
Collapse
Affiliation(s)
- Sean Ekins
- Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC, 27526, USA,
| | | | | | | | | |
Collapse
|
9
|
Matlock M, Swamidass SJ. Sharing chemical relationships does not reveal structures. J Chem Inf Model 2013; 54:37-48. [PMID: 24289228 DOI: 10.1021/ci400399a] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
In this study, we propose a new, secure method of sharing useful chemical information from small-molecule libraries, without revealing the structures of the libraries' molecules. Our method shares the relationship between molecules rather than structural descriptors. This is an important advance because, over the past few years, several groups have developed and published new methods of analyzing small-molecule screening data. These methods include advanced hit-picking protocols, promiscuous active filters, economic optimization algorithms, and screening visualizations, which can identify patterns in the data that might otherwise be overlooked. Application of these methods to private data requires finding strategies for sharing useful chemical data without revealing chemical structures. This problem has been examined in the context of ADME prediction models, with results from information theory suggesting it is impossible to share useful chemical information without revealing structures. In contrast, we present a new strategy for encoding the relationships between molecules instead of their structures, based on anonymized scaffold networks and trees, that safely shares enough chemical information to be useful in analyzing chemical data, while also sufficiently blinding structures from discovery. We present the details of this encoding, an analysis of the usefulness of the information it conveys, and the security of the structures it encodes. This approach makes it possible to share data across institutions, and may securely enable collaborative analysis that can yield insight into both specific projects and screening technology as a whole.
Collapse
Affiliation(s)
- Matthew Matlock
- Washington University School of Medicine , Department of Pathology and Immunology, St. Louis, Missouri 63110, United States
| | | |
Collapse
|
10
|
Simon L, Abdelmalek B. Design of skin penetration enhancers using replacement methods for the selection of the molecular descriptors. Pharmaceutics 2012; 4:343-53. [PMID: 24300295 PMCID: PMC3834920 DOI: 10.3390/pharmaceutics4030343] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2012] [Revised: 06/25/2012] [Accepted: 06/28/2012] [Indexed: 11/23/2022] Open
Abstract
Transdermal delivery of certain drugs is challenging because of skin barrier resistance. This study focuses on the implementation of feature-selection algorithms to design chemical penetration enhancers. A database, consisting of 145 polar and nonpolar chemicals, was chosen for the investigation. Replacement, enhanced replacement and stepwise algorithms were applied to identify relevant structural properties of these compounds. The descriptors were calculated using Molecular Modeling Pro™ Plus. Based on the coefficient of determination, the replacement methods outperformed the stepwise approach in selecting the features that best correlated with the flux enhancement ratio. An artificial neural network model was built to map a subset of descriptors from sixty-one nonpolar enhancers onto the output vector. The R2 value improved from 0.68, for a linear model, to 0.74, which shows that the improved framework might be effective in the design of compounds with user-defined properties.
Collapse
Affiliation(s)
- Laurent Simon
- Otto H. York Department of Chemical, Biological and Pharmaceutical Engineering, New Jersey Institute of Technology, Newark NJ 07102, USA.
| | | |
Collapse
|
11
|
|
12
|
Varnek A, Baskin II. Chemoinformatics as a Theoretical Chemistry Discipline. Mol Inform 2011; 30:20-32. [PMID: 27467875 DOI: 10.1002/minf.201000100] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2010] [Accepted: 01/14/2011] [Indexed: 01/29/2023]
Abstract
Here, chemoinformatics is considered as a theoretical chemistry discipline complementary to quantum chemistry and force-field molecular modeling. These three fields are compared with respect to molecular representation, inference mechanisms, basic concepts and application areas. A chemical space, a fundamental concept of chemoinformatics, is considered with respect to complex relations between chemical objects (graphs or descriptor vectors). Statistical Learning Theory, one of the main mathematical approaches in structure-property modeling, is briefly reviewed. Links between chemoinformatics and its "sister" fields - machine learning, chemometrics and bioinformatics are discussed.
Collapse
Affiliation(s)
- Alexandre Varnek
- Laboratoire d'Infochimie, UMR 7177 CNRS, Université de Strasbourg, 4, rue B. Pascal, Strasbourg 67000, France.
| | - Igor I Baskin
- Department of Chemistry, Moscow State University, Moscow 119991, Russia
| |
Collapse
|
13
|
Weis DC, Visco DP. Computer-aided molecular design using the Signature molecular descriptor: Application to solvent selection. Comput Chem Eng 2010. [DOI: 10.1016/j.compchemeng.2009.10.017] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
14
|
Wong WW, Burkowski FJ. A constructive approach for discovering new drug leads: Using a kernel methodology for the inverse-QSAR problem. J Cheminform 2009; 1:4. [PMID: 20142987 PMCID: PMC2816860 DOI: 10.1186/1758-2946-1-4] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2009] [Accepted: 04/28/2009] [Indexed: 12/04/2022] Open
Abstract
Background
The inverse-QSAR problem seeks to find a new molecular descriptor from which one can recover the structure of a molecule that possess a desired activity or property. Surprisingly, there are very few papers providing solutions to this problem. It is a difficult problem because the molecular descriptors involved with the inverse-QSAR algorithm must adequately address the forward QSAR problem for a given biological activity if the subsequent recovery phase is to be meaningful. In addition, one should be able to construct a feasible molecule from such a descriptor. The difficulty of recovering the molecule from its descriptor is the major limitation of most inverse-QSAR methods. Results
In this paper, we describe the reversibility of our previously reported descriptor, the vector space model molecular descriptor (VSMMD) based on a vector space model that is suitable for kernel studies in QSAR modeling. Our inverse-QSAR approach can be described using five steps: (1) generate the VSMMD for the compounds in the training set; (2) map the VSMMD in the input space to the kernel feature space using an appropriate kernel function; (3) design or generate a new point in the kernel feature space using a kernel feature space algorithm; (4) map the feature space point back to the input space of descriptors using a pre-image approximation algorithm; (5) build the molecular structure template using our VSMMD molecule recovery algorithm. Conclusion
The empirical results reported in this paper show that our strategy of using kernel methodology for an inverse-Quantitative Structure-Activity Relationship is sufficiently powerful to find a meaningful solution for practical problems. Electronic supplementary material The online version of this article (doi:10.1186/1758-2946-1-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- William Wl Wong
- The David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| | | |
Collapse
|
15
|
Bologa C, Allu TK, Olah M, Kappler MA, Oprea TI. Descriptor collision and confusion: Toward the design of descriptors to mask chemical structures. J Comput Aided Mol Des 2005; 19:625-35. [PMID: 16322910 DOI: 10.1007/s10822-005-9020-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2005] [Accepted: 10/06/2005] [Indexed: 10/25/2022]
Abstract
We examined "descriptor collision" for several chemical fingerprint systems (MDL 320, Daylight, SMDL), and for a 2D-based descriptor set. For large databases (ChemNavigator and WOMBAT), the smallest collision rate remains around 5%. We systematically increase the "descriptor collision" rate (here termed "descriptor confusion"), in order to design a set of "descriptors to mask chemical structures", DMCS. If effective, a DMCS system would not allow third parties to determine the original chemical structures used to derive the DMCS set (i.e., reverse engineering). Using SMDL keys, the "confusion" rate is increased to 45.6% by eliminating those keys that have a low frequency of occurrence in WOMBAT structures. We applied an automated PLS engine, WB-PLS [Olah et al., J. Comput. Aided Mol. Des., 18 (2004) 437], to 1277 series of structures from 948 targets in WOMBAT, in order to validate the biological relevance of the SMDL descriptors as a potential DMCS set. The "reduced set" of SMDL descriptors has a small loss of modeling power (around 20%) compared to the initial descriptor set, while the collision rate is significantly increased. These results indicate that the development of an effective DMCS is possible. If well documented, DMCS systems would encourage private sector data release (e.g., related to water solubility) and directly benefit public sector science.
Collapse
Affiliation(s)
- Cristian Bologa
- Division of Biocomputing, University of New Mexico School of Medicine, MSC11 6145, Albuquerque, NM 87131, USA
| | | | | | | | | |
Collapse
|