1
|
Ancajas CMF, Oyedele AS, Butt CM, Walker AS. Advances, opportunities, and challenges in methods for interrogating the structure activity relationships of natural products. Nat Prod Rep 2024. [PMID: 38912779 DOI: 10.1039/d4np00009a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/25/2024]
Abstract
Time span in literature: 1985-early 2024Natural products play a key role in drug discovery, both as a direct source of drugs and as a starting point for the development of synthetic compounds. Most natural products are not suitable to be used as drugs without further modification due to insufficient activity or poor pharmacokinetic properties. Choosing what modifications to make requires an understanding of the compound's structure-activity relationships. Use of structure-activity relationships is commonplace and essential in medicinal chemistry campaigns applied to human-designed synthetic compounds. Structure-activity relationships have also been used to improve the properties of natural products, but several challenges still limit these efforts. Here, we review methods for studying the structure-activity relationships of natural products and their limitations. Specifically, we will discuss how synthesis, including total synthesis, late-stage derivatization, chemoenzymatic synthetic pathways, and engineering and genome mining of biosynthetic pathways can be used to produce natural product analogs and discuss the challenges of each of these approaches. Finally, we will discuss computational methods including machine learning methods for analyzing the relationship between biosynthetic genes and product activity, computer aided drug design techniques, and interpretable artificial intelligence approaches towards elucidating structure-activity relationships from models trained to predict bioactivity from chemical structure. Our focus will be on these latter topics as their applications for natural products have not been extensively reviewed. We suggest that these methods are all complementary to each other, and that only collaborative efforts using a combination of these techniques will result in a full understanding of the structure-activity relationships of natural products.
Collapse
Affiliation(s)
| | | | - Caitlin M Butt
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA.
| | - Allison S Walker
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA.
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, USA
- Department of Pathology, Microbiology, and Immunology, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
2
|
Jia X, Wang T, Zhu H. Advancing Computational Toxicology by Interpretable Machine Learning. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:17690-17706. [PMID: 37224004 PMCID: PMC10666545 DOI: 10.1021/acs.est.3c00653] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 05/05/2023] [Accepted: 05/05/2023] [Indexed: 05/26/2023]
Abstract
Chemical toxicity evaluations for drugs, consumer products, and environmental chemicals have a critical impact on human health. Traditional animal models to evaluate chemical toxicity are expensive, time-consuming, and often fail to detect toxicants in humans. Computational toxicology is a promising alternative approach that utilizes machine learning (ML) and deep learning (DL) techniques to predict the toxicity potentials of chemicals. Although the applications of ML- and DL-based computational models in chemical toxicity predictions are attractive, many toxicity models are "black boxes" in nature and difficult to interpret by toxicologists, which hampers the chemical risk assessments using these models. The recent progress of interpretable ML (IML) in the computer science field meets this urgent need to unveil the underlying toxicity mechanisms and elucidate the domain knowledge of toxicity models. In this review, we focused on the applications of IML in computational toxicology, including toxicity feature data, model interpretation methods, use of knowledge base frameworks in IML development, and recent applications. The challenges and future directions of IML modeling in toxicology are also discussed. We hope this review can encourage efforts in developing interpretable models with new IML algorithms that can assist new chemical assessments by illustrating toxicity mechanisms in humans.
Collapse
Affiliation(s)
- Xuelian Jia
- Department
of Chemistry and Biochemistry, Rowan University, Glassboro, New Jersey 08028, United States
| | - Tong Wang
- Department
of Chemistry and Biochemistry, Rowan University, Glassboro, New Jersey 08028, United States
| | - Hao Zhu
- Department
of Chemistry and Biochemistry, Rowan University, Glassboro, New Jersey 08028, United States
| |
Collapse
|
3
|
Evaluating eXplainable artificial intelligence tools for hard disk drive predictive maintenance. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10354-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
4
|
Jiménez-Luna J, Skalic M, Weskamp N. Benchmarking Molecular Feature Attribution Methods with Activity Cliffs. J Chem Inf Model 2022; 62:274-283. [PMID: 35019265 DOI: 10.1021/acs.jcim.1c01163] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Feature attribution techniques are popular choices within the explainable artificial intelligence toolbox, as they can help elucidate which parts of the provided inputs used by an underlying supervised-learning method are considered relevant for a specific prediction. In the context of molecular design, these approaches typically involve the coloring of molecular graphs, whose presentation to medicinal chemists can be useful for making a decision of which compounds to synthesize or prioritize. The consistency of the highlighted moieties alongside expert background knowledge is expected to contribute to the understanding of machine-learning models in drug design. Quantitative evaluation of such coloring approaches, however, has so far been limited to substructure identification tasks. We here present an approach that is based on maximum common substructure algorithms applied to experimentally-determined activity cliffs. Using the proposed benchmark, we found that molecule coloring approaches in conjunction with classical machine-learning models tend to outperform more modern, graph-neural-network alternatives. The provided benchmark data are fully open sourced, which we hope will facilitate the testing of newly developed molecular feature attribution techniques.
Collapse
Affiliation(s)
- José Jiménez-Luna
- Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, 8093 Zurich, Switzerland.,Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397 Biberach an der Riss, Germany
| | - Miha Skalic
- Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397 Biberach an der Riss, Germany
| | - Nils Weskamp
- Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397 Biberach an der Riss, Germany
| |
Collapse
|
5
|
Rodríguez-Pérez R, Bajorath J. Explainable Machine Learning for Property Predictions in Compound Optimization. J Med Chem 2021; 64:17744-17752. [PMID: 34902252 DOI: 10.1021/acs.jmedchem.1c01789] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The prediction of compound properties from chemical structure is a main task for machine learning (ML) in medicinal chemistry. ML is often applied to large data sets in applications such as compound screening, virtual library enumeration, or generative chemistry. Albeit desirable, a detailed understanding of ML model decisions is typically not required in these cases. By contrast, compound optimization efforts rely on small data sets to identify structural modifications leading to desired property profiles. In this situation, if ML is applied, one usually is reluctant to make decisions based on predictions that cannot be rationalized. Only few ML methods are interpretable. However, to yield insights into complex ML model decisions, explanatory approaches can be applied. Herein, methodologies for better understanding of ML models or explaining individual predictions are reviewed and current challenges in integrating ML into medicinal chemistry programs as well as future opportunities are discussed.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany.,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| |
Collapse
|
6
|
Heat Maps: Perfect Maps for Quick Reading? Comparing Usability of Heat Maps with Different Levels of Generalization. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2021. [DOI: 10.3390/ijgi10080562] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
Recently, due to Web 2.0 and neocartography, heat maps have become a popular map type for quick reading. Heat maps are graphical representations of geographic data density in the form of raster maps, elaborated by applying kernel density estimation with a given radius on point- or linear-input data. The aim of this study was to compare the usability of heat maps with different levels of generalization (defined by radii of 10, 20, 30, and 40 pixels) for basic map user tasks. A user study with 412 participants (16–20 years old, high school students) was carried out in order to compare heat maps that showed the same input data. The study was conducted in schools during geography or IT lessons. Objective (the correctness of the answer, response times) and subjective (response time self-assessment, task difficulty, preferences) metrics were measured. The results show that the smaller radius resulted in the higher correctness of the answers. A larger radius did not result in faster response times. The participants perceived the more generalized maps as easier to use, although this result did not match the performance metrics. Overall, we believe that heat maps, in given circumstances and appropriate design settings, can be considered an efficient method for spatial data presentation.
Collapse
|
7
|
Ye Z, Yang W, Yang Y, Ouyang D. Interpretable machine learning methods for in vitro pharmaceutical formulation development. FOOD FRONTIERS 2021. [DOI: 10.1002/fft2.78] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Affiliation(s)
- Zhuyifan Ye
- State Key Laboratory of Quality Research in Chinese Medicine Institute of Chinese Medical Sciences (ICMS) University of Macau Macau China
| | - Wenmian Yang
- State Key Laboratory of Internet of Things for Smart City University of Macau Macau China
| | - Yilong Yang
- School of Software Beihang University Beijing China
| | - Defang Ouyang
- State Key Laboratory of Quality Research in Chinese Medicine Institute of Chinese Medical Sciences (ICMS) University of Macau Macau China
| |
Collapse
|
8
|
Wang Z, Dreyer F, Pulvermüller F, Ntemou E, Vajkoczy P, Fekonja LS, Picht T. Support vector machine based aphasia classification of transcranial magnetic stimulation language mapping in brain tumor patients. Neuroimage Clin 2020; 29:102536. [PMID: 33360768 PMCID: PMC7772815 DOI: 10.1016/j.nicl.2020.102536] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Revised: 11/30/2020] [Accepted: 12/12/2020] [Indexed: 12/03/2022]
Abstract
Repetitive TMS (rTMS) allows for non-invasive and transient disruption of local neuronal functioning. We used machine learning approaches to assess whether brain tumor patients can be accurately classified into aphasic and non-aphasic groups using their rTMS language mapping results as input features. Given that each tumor affects the subject-specific language networks differently, resulting in heterogenous rTMS functional mappings, we propose the use of machine learning strategies to classify potential patterns of rTMS language mapping results. We retrospectively included 90 patients with left perisylvian world health organization (WHO) grade II-IV gliomas that underwent presurgical navigated rTMS language mapping. Within our cohort, 29 of 90 (32.2%) patients suffered from at least mild aphasia as shown in the Aachen Aphasia Test based Berlin Aphasia Score (BAS). After spatial normalization to MNI 152 of all rTMS spots, we calculated the error rate (ER) in each stimulated cortical area (28 regions of interest, ROI) by automated anatomical labeling parcellation (AAL3) and IIT. We used a support vector machine (SVM) to classify significant areas in relation to aphasia. After feeding the ROIs into the SVM model, it revealed that in addition to age (w = 2.98), the ERs of the left supramarginal gyrus (w = 3.64), left inferior parietal gyrus (w = 2.28) and right pars triangularis (w = 1.34) contributed more than other features to the model. The model's sensitivity was 86.2%, the specificity was 82.0%, the overall accuracy was 85.5% and the AUC was 89.3%. Our results demonstrate an increased vulnerability of right inferior pars triangularis to rTMS in aphasic patients due to left perisylvian gliomas. This finding points towards a functional relevant involvement of the right pars triangularis in response to aphasia. The tumor location feature, specified by calculating overlaps with white and grey matter atlases, did not affect the SVM model. The left supramarginal gyrus as a feature improved our SVM model the most. Additionally, our results could point towards a decreasing potential for neuroplasticity with age.
Collapse
Affiliation(s)
- Ziqian Wang
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Felix Dreyer
- Cluster of Excellence: "Matters of Activity. Image Space Material", Humboldt Universität zu Berlin, Berlin, Germany; Freie Universität Berlin, Brain Language Laboratory, Department of Philosophy and Humanities, Berlin, Germany
| | - Friedemann Pulvermüller
- Cluster of Excellence: "Matters of Activity. Image Space Material", Humboldt Universität zu Berlin, Berlin, Germany; Freie Universität Berlin, Brain Language Laboratory, Department of Philosophy and Humanities, Berlin, Germany
| | - Effrosyni Ntemou
- University of Groningen, Department of Neurolinguistics, Groningen, The Netherlands
| | - Peter Vajkoczy
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Lucius S Fekonja
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany; Cluster of Excellence: "Matters of Activity. Image Space Material", Humboldt Universität zu Berlin, Berlin, Germany.
| | - Thomas Picht
- Department of Neurosurgery, Charité - Universitätsmedizin Berlin, Berlin, Germany; Cluster of Excellence: "Matters of Activity. Image Space Material", Humboldt Universität zu Berlin, Berlin, Germany
| |
Collapse
|
9
|
Tinkov O, Polishchuk P, Matveieva M, Grigorev V, Grigoreva L, Porozov Y. The Influence of Structural Patterns on Acute Aquatic Toxicity of Organic Compounds. Mol Inform 2020; 40:e2000209. [PMID: 33029954 DOI: 10.1002/minf.202000209] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2020] [Accepted: 10/01/2020] [Indexed: 12/28/2022]
Abstract
Investigation of the influence of molecular structure of different organic compounds on acute toxicity towards Fathead minnow, Daphnia magna, and Tetrahymena pyriformis has been carried out using 2D simplex representation of molecular structure and two modelling methods: Random Forest (RF) and Gradient Boosting Machine (GBM). Suitable QSAR (Quantitative Structure - Activity Relationships) models were obtained. The study was focused on QSAR models interpretation. The aim of the study was to develop a set of structural fragments that simultaneously consistently increase toxicity toward Fathead minnow, Daphnia magna, Tetrahymena pyriformis. The interpretation allowed to gain more details about known toxicophores and to propose new fragments. The results obtained made it possible to rank the contributions of molecular fragments to various types of toxicity to aquatic organisms. This information can be used for molecular optimization of chemicals. According to the results of structural interpretation, the most significant common mechanisms of the toxic effect of organic compounds on Fathead minnow, Daphnia magna and Tetrahymena pyriformis are reactions of nucleophilic substitution and inhibition of oxidative phosphorylation in mitochondria. In addition acetylcholinesterase and voltage-gated ion channel of Fathead minnow and Daphnia magna are important targets for toxicants. The on-line version of the OCHEM expert system (https://ochem.eu) were used for a comparative QSAR investigation. The proposed QSAR models comply with the OECD principles and can be used to reliably predict acute toxicity of organic compounds towards Fathead minnow, Daphnia magna and Tetrahymena pyriformis with allowance for applicability domain estimation.
Collapse
Affiliation(s)
- Oleg Tinkov
- Department of Computer Science, Military Institute of the Ministry of Defense, 3300, Gogol str. 2"B", Tiraspol, Transdniestria, Moldova.,Department of Pharmacology and Pharmaceutical Chemistry, Medical Faculty, Transnistrian State University, 3300, October 25 str. 128, Tiraspol, Transdniestria, Moldova
| | - Pavel Polishchuk
- Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry Palacký University and University Hospital in Olomouc, Hnevotinska 5, 77900, Olomouc, Czech Republic
| | - Mariia Matveieva
- Institute of Molecular and Translational Medicine Faculty of Medicine and Dentistry Palacký University and University Hospital in Olomouc, Hnevotinska 5, 77900, Olomouc, Czech Republic
| | - Veniamin Grigorev
- Institute of Physiologically Active Compounds, Russian Academy of Sciences, 142432, Severniy proezd 1, Chernogolovka, Moscow region, Russia
| | - Ludmila Grigoreva
- Department of Fundamental Physical and Chemical Engineering, Moscow State University, 119991, Leninskiye Gory 1/51, Moscow, Russia
| | - Yuri Porozov
- World-Class Research Center "Digital biodesign and personalized healthcare", I.M. Sechenov First Moscow State Medical University, Moscow, Russia.,Department of Computational Biology, Sirius University of Science and Technology, 354340, Olympic Ave 1, Sochi, Russia
| |
Collapse
|
10
|
Sheridan RP. Interpretation of QSAR Models by Coloring Atoms According to Changes in Predicted Activity: How Robust Is It? J Chem Inf Model 2019; 59:1324-1337. [DOI: 10.1021/acs.jcim.8b00825] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Robert P. Sheridan
- Modeling and Informatics, Merck & Co. Inc., Kenilworth, New Jersey 07065, United States
| |
Collapse
|
11
|
Mellor C, Marchese Robinson R, Benigni R, Ebbrell D, Enoch S, Firman J, Madden J, Pawar G, Yang C, Cronin M. Molecular fingerprint-derived similarity measures for toxicological read-across: Recommendations for optimal use. Regul Toxicol Pharmacol 2019; 101:121-134. [DOI: 10.1016/j.yrtph.2018.11.002] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2018] [Revised: 10/09/2018] [Accepted: 11/12/2018] [Indexed: 12/20/2022]
|
12
|
Pu L, Naderi M, Liu T, Wu HC, Mukhopadhyay S, Brylinski M. eToxPred: a machine learning-based approach to estimate the toxicity of drug candidates. BMC Pharmacol Toxicol 2019; 20:2. [PMID: 30621790 PMCID: PMC6325674 DOI: 10.1186/s40360-018-0282-6] [Citation(s) in RCA: 65] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 12/26/2018] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND The efficiency of drug development defined as a number of successfully launched new pharmaceuticals normalized by financial investments has significantly declined. Nonetheless, recent advances in high-throughput experimental techniques and computational modeling promise reductions in the costs and development times required to bring new drugs to market. The prediction of toxicity of drug candidates is one of the important components of modern drug discovery. RESULTS In this work, we describe eToxPred, a new approach to reliably estimate the toxicity and synthetic accessibility of small organic compounds. eToxPred employs machine learning algorithms trained on molecular fingerprints to evaluate drug candidates. The performance is assessed against multiple datasets containing known drugs, potentially hazardous chemicals, natural products, and synthetic bioactive compounds. Encouragingly, eToxPred predicts the synthetic accessibility with the mean square error of only 4% and the toxicity with the accuracy of as high as 72%. CONCLUSIONS eToxPred can be incorporated into protocols to construct custom libraries for virtual screening in order to filter out those drug candidates that are potentially toxic or would be difficult to synthesize. It is freely available as a stand-alone software at https://github.com/pulimeng/etoxpred .
Collapse
Affiliation(s)
- Limeng Pu
- Division of Electrical & Computer Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Misagh Naderi
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Tairan Liu
- Department of Mechanical Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Hsiao-Chun Wu
- Division of Electrical & Computer Engineering, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Supratik Mukhopadhyay
- Department of Computer Science, Louisiana State University, Baton Rouge, LA, 70803, USA
| | - Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, 70803, USA. .,Center for Computation & Technology, Louisiana State University, Baton Rouge, LA, 70803, USA.
| |
Collapse
|
13
|
Helal S, Li J, Liu L, Ebrahimie E, Dawson S, Murray DJ, Long Q. Predicting academic performance by considering student heterogeneity. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.07.042] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
14
|
Mayr A, Klambauer G, Unterthiner T, Steijaert M, Wegner JK, Ceulemans H, Clevert DA, Hochreiter S. Large-scale comparison of machine learning methods for drug target prediction on ChEMBL. Chem Sci 2018; 9:5441-5451. [PMID: 30155234 PMCID: PMC6011237 DOI: 10.1039/c8sc00148k] [Citation(s) in RCA: 252] [Impact Index Per Article: 42.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 05/16/2018] [Indexed: 12/24/2022] Open
Abstract
Deep learning is currently the most successful machine learning technique in a wide range of application areas and has recently been applied successfully in drug discovery research to predict potential drug targets and to screen for active molecules. However, due to (1) the lack of large-scale studies, (2) the compound series bias that is characteristic of drug discovery datasets and (3) the hyperparameter selection bias that comes with the high number of potential deep learning architectures, it remains unclear whether deep learning can indeed outperform existing computational methods in drug discovery tasks. We therefore assessed the performance of several deep learning methods on a large-scale drug discovery dataset and compared the results with those of other machine learning and target prediction methods. To avoid potential biases from hyperparameter selection or compound series, we used a nested cluster-cross-validation strategy. We found (1) that deep learning methods significantly outperform all competing methods and (2) that the predictive performance of deep learning is in many cases comparable to that of tests performed in wet labs (i.e., in vitro assays).
Collapse
Affiliation(s)
- Andreas Mayr
- LIT AI Lab and Institute of Bioinformatics , Johannes Kepler University Linz , Austria . ; ; Tel: +43-732-2468-4521
| | - Günter Klambauer
- LIT AI Lab and Institute of Bioinformatics , Johannes Kepler University Linz , Austria . ; ; Tel: +43-732-2468-4521
| | - Thomas Unterthiner
- LIT AI Lab and Institute of Bioinformatics , Johannes Kepler University Linz , Austria . ; ; Tel: +43-732-2468-4521
| | | | | | | | | | - Sepp Hochreiter
- LIT AI Lab and Institute of Bioinformatics , Johannes Kepler University Linz , Austria . ; ; Tel: +43-732-2468-4521
| |
Collapse
|
15
|
Marchese Robinson RL, Palczewska A, Palczewski J, Kidley N. Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets. J Chem Inf Model 2017; 57:1773-1792. [PMID: 28715209 DOI: 10.1021/acs.jcim.6b00753] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The ability to interpret the predictions made by quantitative structure-activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package ( https://r-forge.r-project.org/R/?group_id=1725 ) for the R statistical programming language and the Python program HeatMapWrapper [ https://doi.org/10.5281/zenodo.495163 ] for heat map generation.
Collapse
Affiliation(s)
- Richard L Marchese Robinson
- Syngenta Ltd., Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, United Kingdom.,School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University , James Parsons Building, Byrom Street, Liverpool L3 3AF, United Kingdom
| | - Anna Palczewska
- Department of Computing, University of Bradford , Bradford BD7 1DP, United Kingdom
| | - Jan Palczewski
- School of Mathematics, University of Leeds , Leeds LS2 9JT, United Kingdom
| | - Nathan Kidley
- Syngenta Ltd., Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, United Kingdom
| |
Collapse
|
16
|
Improving the expressiveness of black-box models for predicting student performance. COMPUTERS IN HUMAN BEHAVIOR 2017. [DOI: 10.1016/j.chb.2016.09.001] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
17
|
Koutsoukas A, Monaghan KJ, Li X, Huan J. Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J Cheminform 2017; 9:42. [PMID: 29086090 PMCID: PMC5489441 DOI: 10.1186/s13321-017-0226-y] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 05/27/2017] [Indexed: 01/03/2023] Open
Abstract
Background In recent years, research in artificial neural networks has resurged, now under the deep-learning umbrella, and grown extremely popular. Recently reported success of DL techniques in crowd-sourced QSAR and predictive toxicology competitions has showcased these methods as powerful tools in drug-discovery and toxicology research. The aim of this work was dual, first large number of hyper-parameter configurations were explored to investigate how they affect the performance of DNNs and could act as starting points when tuning DNNs and second their performance was compared to popular methods widely employed in the field of cheminformatics namely Naïve Bayes, k-nearest neighbor, random forest and support vector machines. Moreover, robustness of machine learning methods to different levels of artificially introduced noise was assessed. The open-source Caffe deep-learning framework and modern NVidia GPU units were utilized to carry out this study, allowing large number of DNN configurations to be explored. Results We show that feed-forward deep neural networks are capable of achieving strong classification performance and outperform shallow methods across diverse activity classes when optimized. Hyper-parameters that were found to play critical role are the activation function, dropout regularization, number hidden layers and number of neurons. When compared to the rest methods, tuned DNNs were found to statistically outperform, with p value <0.01 based on Wilcoxon statistical test. DNN achieved on average MCC units of 0.149 higher than NB, 0.092 than kNN, 0.052 than SVM with linear kernel, 0.021 than RF and finally 0.009 higher than SVM with radial basis function kernel. When exploring robustness to noise, non-linear methods were found to perform well when dealing with low levels of noise, lower than or equal to 20%, however when dealing with higher levels of noise, higher than 30%, the Naïve Bayes method was found to perform well and even outperform at the highest level of noise 50% more sophisticated methods across several datasets. Electronic supplementary material The online version of this article (doi:10.1186/s13321-017-0226-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexios Koutsoukas
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Keith J Monaghan
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Xiaoli Li
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA
| | - Jun Huan
- Department of Electrical Engineering and Computer Sciences, University of Kansas, Lawrence, KS, 66047-7621, USA.
| |
Collapse
|
18
|
Shoombuatong W, Prathipati P, Owasirikul W, Worachartcheewan A, Simeon S, Anuwongcharoen N, Wikberg JES, Nantasenamat C. Towards the Revival of Interpretable QSAR Models. CHALLENGES AND ADVANCES IN COMPUTATIONAL CHEMISTRY AND PHYSICS 2017. [DOI: 10.1007/978-3-319-56850-8_1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
|
19
|
Gütlein M, Kramer S. Filtered circular fingerprints improve either prediction or runtime performance while retaining interpretability. J Cheminform 2016; 8:60. [PMID: 27853484 PMCID: PMC5088672 DOI: 10.1186/s13321-016-0173-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Accepted: 10/18/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Even though circular fingerprints have been first introduced more than 50 years ago, they are still widely used for building highly predictive, state-of-the-art (Q)SAR models. Historically, these structural fragments were designed to search large molecular databases. Hence, to derive a compact representation, circular fingerprint fragments are often folded to comparatively short bit-strings. However, folding fingerprints introduces bit collisions, and therefore adds noise to the encoded structural information and removes its interpretability. Both representations, folded as well as unprocessed fingerprints, are often used for (Q)SAR modeling. RESULTS We show that it can be preferable to build (Q)SAR models with circular fingerprint fragments that have been filtered by supervised feature selection, instead of applying folded or all fragments. Compared to folded fingerprints, filtered fingerprints significantly increase predictive performance and remain unambiguous and interpretable. Compared to unprocessed fingerprints, filtered fingerprints reduce the computational effort and are a more compact and less redundant feature representation. Depending on the selected learning algorithm filtering yields about equally predictive (Q)SAR models. We demonstrate the suitability of filtered fingerprints for (Q)SAR modeling by presenting our freely available web service Collision-free Filtered Circular Fingerprints that provides rationales for predictions by highlighting important structural features in the query compound (see http://coffer.informatik.uni-mainz.de). CONCLUSIONS Circular fingerprints are potent structural features that yield highly predictive models and encode interpretable structural information. However, to not lose interpretability, circular fingerprints should not be folded when building prediction models. Our experiments show that filtering is a suitable option to reduce the high computational effort when working with all fingerprint fragments. Additionally, our experiments suggest that the area under precision recall curve is a more sensible statistic for validating (Q)SAR models for virtual screening than the area under ROC or other measures for early recognition. GRAPHICAL ABSTRACT
Collapse
Affiliation(s)
- Martin Gütlein
- Chair of Data Mining, Institute of Computer Science, Johannes Gutenberg - Universität Mainz, Staudingerweg 9, 55128 Mainz, Germany
| | - Stefan Kramer
- Chair of Data Mining, Institute of Computer Science, Johannes Gutenberg - Universität Mainz, Staudingerweg 9, 55128 Mainz, Germany
| |
Collapse
|
20
|
Polishchuk P, Tinkov O, Khristova T, Ognichenko L, Kosinskaya A, Varnek A, Kuz’min V. Structural and Physico-Chemical Interpretation (SPCI) of QSAR Models and Its Comparison with Matched Molecular Pair Analysis. J Chem Inf Model 2016; 56:1455-69. [DOI: 10.1021/acs.jcim.6b00371] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Affiliation(s)
- Pavel Polishchuk
- Institute
of Molecular and Translational Medicine, Faculty of Medicine and Dentistry, Palacký University and University Hospital in Olomouc, Hněvotínská
1333/5, 779 00 Olomouc, Czech Republic
- A. V. Bogatsky Physico-Chemical Institute of National Academy of Sciences of Ukraine, Lustdorfskaya
doroga 86, 65080 Odessa, Ukraine
| | - Oleg Tinkov
- T. G. Shevchenko Transdniestria State University, ul. 25 Oktyabrya 107, 3300 Tiraspol, Transdniestria, Republic of Moldova
| | - Tatiana Khristova
- A. V. Bogatsky Physico-Chemical Institute of National Academy of Sciences of Ukraine, Lustdorfskaya
doroga 86, 65080 Odessa, Ukraine
- Laboratoire
de Chémoinformatique, UMR 7140 CNRS, Université de Strasbourg, 1 rue Blaise Pascal, 67000 Strasbourg, France
| | - Ludmila Ognichenko
- A. V. Bogatsky Physico-Chemical Institute of National Academy of Sciences of Ukraine, Lustdorfskaya
doroga 86, 65080 Odessa, Ukraine
| | - Anna Kosinskaya
- A. V. Bogatsky Physico-Chemical Institute of National Academy of Sciences of Ukraine, Lustdorfskaya
doroga 86, 65080 Odessa, Ukraine
| | - Alexandre Varnek
- Laboratoire
de Chémoinformatique, UMR 7140 CNRS, Université de Strasbourg, 1 rue Blaise Pascal, 67000 Strasbourg, France
- Laboratory
of Chemoinformatics and Molecular Modeling, Butlerov Institut of Chemistry, Kazan Federal University, Kremlevskaya 18, Kazan, Russia
| | - Victor Kuz’min
- A. V. Bogatsky Physico-Chemical Institute of National Academy of Sciences of Ukraine, Lustdorfskaya
doroga 86, 65080 Odessa, Ukraine
| |
Collapse
|
21
|
Rivera-Borroto OM, García-de la Vega JM, Marrero-Ponce Y, Grau R. Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:158-67. [PMID: 26886740 DOI: 10.1109/tcbb.2015.2424435] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants.
Collapse
|
22
|
Gagliano SA, Ravji R, Barnes MR, Weale ME, Knight J. Smoking Gun or Circumstantial Evidence? Comparison of Statistical Learning Methods using Functional Annotations for Prioritizing Risk Variants. Sci Rep 2015; 5:13373. [PMID: 26300220 PMCID: PMC4642511 DOI: 10.1038/srep13373] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Accepted: 07/24/2015] [Indexed: 11/09/2022] Open
Abstract
Although technology has triumphed in facilitating routine genome sequencing, new challenges have been created for the data-analyst. Genome-scale surveys of human variation generate volumes of data that far exceed capabilities for laboratory characterization. By incorporating functional annotations as predictors, statistical learning has been widely investigated for prioritizing genetic variants likely to be associated with complex disease. We compared three published prioritization procedures, which use different statistical learning algorithms and different predictors with regard to the quantity, type and coding. We also explored different combinations of algorithm and annotation set. As an application, we tested which methodology performed best for prioritizing variants using data from a large schizophrenia meta-analysis by the Psychiatric Genomics Consortium. Results suggest that all methods have considerable (and similar) predictive accuracies (AUCs 0.64-0.71) in test set data, but there is more variability in the application to the schizophrenia GWAS. In conclusion, a variety of algorithms and annotations seem to have a similar potential to effectively enrich true risk variants in genome-scale datasets, however none offer more than incremental improvement in prediction. We discuss how methods might be evolved for risk variant prediction to address the impending bottleneck of the new generation of genome re-sequencing studies.
Collapse
Affiliation(s)
- Sarah A Gagliano
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada.,Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada.,Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada
| | - Reena Ravji
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada
| | - Michael R Barnes
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - Michael E Weale
- Department of Medical &Molecular Genetics, King's College London, Guy's Hospital, London, UK
| | - Jo Knight
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada.,Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada.,Department of Psychiatry, University of Toronto, Toronto, Ontario, Canada.,Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
23
|
Balfer J, Bajorath J. Visualization and Interpretation of Support Vector Machine Activity Predictions. J Chem Inf Model 2015; 55:1136-47. [DOI: 10.1021/acs.jcim.5b00175] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Jenny Balfer
- Department of Life Science
Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal
Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science
Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal
Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany
| |
Collapse
|
24
|
Cortes-Ciriano I, Murrell DS, van Westen GJ, Bender A, Malliavin TE. Prediction of the potency of mammalian cyclooxygenase inhibitors with ensemble proteochemometric modeling. J Cheminform 2015; 7:1. [PMID: 25705261 PMCID: PMC4335128 DOI: 10.1186/s13321-014-0049-z] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Accepted: 11/21/2014] [Indexed: 12/16/2022] Open
Abstract
Cyclooxygenases (COX) are present in the body in two isoforms, namely: COX-1, constitutively expressed, and COX-2, induced in physiopathological conditions such as cancer or chronic inflammation. The inhibition of COX with non-steroideal anti-inflammatory drugs (NSAIDs) is the most widely used treatment for chronic inflammation despite the adverse effects associated to prolonged NSAIDs intake. Although selective COX-2 inhibition has been shown not to palliate all adverse effects (e.g. cardiotoxicity), there are still niche populations which can benefit from selective COX-2 inhibition. Thus, capitalizing on bioactivity data from both isoforms simultaneously would contribute to develop COX inhibitors with better safety profiles. We applied ensemble proteochemometric modeling (PCM) for the prediction of the potency of 3,228 distinct COX inhibitors on 11 mammalian cyclooxygenases. Ensemble PCM models ([Formula: see text], and RMSEtest = 0.71) outperformed models exclusively trained on compound ([Formula: see text], and RMSEtest = 1.09) or protein descriptors ([Formula: see text] and RMSEtest = 1.10) on the test set. Moreover, PCM predicted COX potency for 1,086 selective and non-selective COX inhibitors with [Formula: see text] and RMSEtest = 0.76. These values are in agreement with the maximum and minimum achievable [Formula: see text] and RMSEtest values of approximately 0.68 for both metrics. Confidence intervals for individual predictions were calculated from the standard deviation of the predictions from the individual models composing the ensembles. Finally, two substructure analysis pipelines singled out chemical substructures implicated in both potency and selectivity in agreement with the literature. Graphical AbstractPrediction of uncorrelated bioactivity profiles for mammalian COX inhibitors with Ensemble Proteochemometric Modeling.
Collapse
Affiliation(s)
- Isidro Cortes-Ciriano
- Département de Biologie Structurale et Chimie, Institut Pasteur, Unité de Bioinformatique Structurale; CNRS UMR 3825, 25, rue du Dr Roux, Paris, 75015 France
| | - Daniel S Murrell
- Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Gerard Jp van Westen
- European Molecular Biology Laboratory European Bioinformatics Institute Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD UK
| | - Andreas Bender
- Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Thérèse E Malliavin
- Département de Biologie Structurale et Chimie, Institut Pasteur, Unité de Bioinformatique Structurale; CNRS UMR 3825, 25, rue du Dr Roux, Paris, 75015 France
| |
Collapse
|
25
|
Dörr A, Rosenbaum L, Zell A. A ranking method for the concurrent learning of compounds with various activity profiles. J Cheminform 2015; 7:2. [PMID: 25643067 PMCID: PMC4306736 DOI: 10.1186/s13321-014-0050-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 12/11/2014] [Indexed: 11/30/2022] Open
Abstract
Background In this study, we present a SVM-based ranking algorithm for the concurrent learning of compounds with different activity profiles and their varying prioritization. To this end, a specific labeling of each compound was elaborated in order to infer virtual screening models against multiple targets. We compared the method with several state-of-the-art SVM classification techniques that are capable of inferring multi-target screening models on three chemical data sets (cytochrome P450s, dehydrogenases, and a trypsin-like protease data set) containing three different biological targets each. Results The experiments show that ranking-based algorithms show an increased performance for single- and multi-target virtual screening. Moreover, compounds that do not completely fulfill the desired activity profile are still ranked higher than decoys or compounds with an entirely undesired profile, compared to other multi-target SVM methods. Conclusions SVM-based ranking methods constitute a valuable approach for virtual screening in multi-target drug design. The utilization of such methods is most helpful when dealing with compounds with various activity profiles and the finding of many ligands with an already perfectly matching activity profile is not to be expected. Electronic supplementary material The online version of this article (doi:10.1186/s13321-014-0050-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alexander Dörr
- Center for Bioinformatics Tübingen (ZBIT), University of Tuebingen, Sand 1, Tübingen, 72076 Germany
| | - Lars Rosenbaum
- Center for Bioinformatics Tübingen (ZBIT), University of Tuebingen, Sand 1, Tübingen, 72076 Germany
| | - Andreas Zell
- Center for Bioinformatics Tübingen (ZBIT), University of Tuebingen, Sand 1, Tübingen, 72076 Germany
| |
Collapse
|
26
|
Carroll G, Slip D, Jonsen I, Harcourt R. Supervised accelerometry analysis can identify prey capture by penguins at sea. ACTA ACUST UNITED AC 2014; 217:4295-302. [PMID: 25394635 DOI: 10.1242/jeb.113076] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Determining where, when and how much animals eat is fundamental to understanding their ecology. We developed a technique to identify a prey capture signature for little penguins from accelerometry, in order to quantify food intake remotely. We categorised behaviour of captive penguins from HD video and matched this to time-series data from back-mounted accelerometers. We then trained a support vector machine (SVM) to classify the penguins' behaviour at 0.3 s intervals as either 'prey handling' or 'swimming'. We applied this model to accelerometer data collected from foraging wild penguins to identify prey capture events. We compared prey capture and non-prey capture dives to test the model predictions against foraging theory. The SVM had an accuracy of 84.95±0.26% (mean ± s.e.) and a false positive rate of 9.82±0.24% when tested on unseen captive data. For wild data, we defined three independent, consecutive prey handling observations as representing true prey capture, with a false positive rate of 0.09%. Dives with prey captures had longer duration and bottom times, were deeper, had faster ascent rates, and had more 'wiggles' and 'dashes' (proxies for prey encounter used in other studies). The mean (±s.e.) number of prey captures per foraging trip was 446.6±66.28. By recording the behaviour of captive animals on HD video and using a supervised machine learning approach, we show that accelerometry signatures can classify the behaviour of wild animals at unprecedentedly fine scales.
Collapse
Affiliation(s)
- Gemma Carroll
- Department of Biological Sciences, Macquarie University, North Ryde, Sydney, NSW 2109, Australia.
| | - David Slip
- Taronga Conservation Society Australia, Bradley's Head Road, Mosman, Sydney, NSW 2088, Australia
| | - Ian Jonsen
- Taronga Conservation Society Australia, Bradley's Head Road, Mosman, Sydney, NSW 2088, Australia
| | - Rob Harcourt
- Taronga Conservation Society Australia, Bradley's Head Road, Mosman, Sydney, NSW 2088, Australia
| |
Collapse
|
27
|
Balfer J, Bajorath J. Introduction of a methodology for visualization and graphical interpretation of Bayesian classification models. J Chem Inf Model 2014; 54:2451-68. [PMID: 25137527 DOI: 10.1021/ci500410g] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
Supervised machine learning models are widely used in chemoinformatics, especially for the prediction of new active compounds or targets of known actives. Bayesian classification methods are among the most popular machine learning approaches for the prediction of activity from chemical structure. Much work has focused on predicting structure-activity relationships (SARs) on the basis of experimental training data. By contrast, only a few efforts have thus far been made to rationalize the performance of Bayesian or other supervised machine learning models and better understand why they might succeed or fail. In this study, we introduce an intuitive approach for the visualization and graphical interpretation of naïve Bayesian classification models. Parameters derived during supervised learning are visualized and interactively analyzed to gain insights into model performance and identify features that determine predictions. The methodology is introduced in detail and applied to assess Bayesian modeling efforts and predictions on compound data sets of varying structural complexity. Different classification models and features determining their performance are characterized in detail. A prototypic implementation of the approach is provided.
Collapse
Affiliation(s)
- Jenny Balfer
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität , Dahlmannstrasse 2, D-53113 Bonn, Germany
| | | |
Collapse
|
28
|
Cherkasov A, Muratov EN, Fourches D, Varnek A, Baskin II, Cronin M, Dearden J, Gramatica P, Martin YC, Todeschini R, Consonni V, Kuz'min VE, Cramer R, Benigni R, Yang C, Rathman J, Terfloth L, Gasteiger J, Richard A, Tropsha A. QSAR modeling: where have you been? Where are you going to? J Med Chem 2014; 57:4977-5010. [PMID: 24351051 PMCID: PMC4074254 DOI: 10.1021/jm4004285] [Citation(s) in RCA: 1040] [Impact Index Per Article: 104.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Quantitative structure-activity relationship modeling is one of the major computational tools employed in medicinal chemistry. However, throughout its entire history it has drawn both praise and criticism concerning its reliability, limitations, successes, and failures. In this paper, we discuss (i) the development and evolution of QSAR; (ii) the current trends, unsolved problems, and pressing challenges; and (iii) several novel and emerging applications of QSAR modeling. Throughout this discussion, we provide guidelines for QSAR development, validation, and application, which are summarized in best practices for building rigorously validated and externally predictive QSAR models. We hope that this Perspective will help communications between computational and experimental chemists toward collaborative development and use of QSAR models. We also believe that the guidelines presented here will help journal editors and reviewers apply more stringent scientific standards to manuscripts reporting new QSAR studies, as well as encourage the use of high quality, validated QSARs for regulatory decision making.
Collapse
Affiliation(s)
- Artem Cherkasov
- Vancouver Prostate Centre, University of British Columbia, Vancouver, BC, V6H3Z6, Canada
| | - Eugene N. Muratov
- Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA
- Department of Molecular Structure and Cheminformatics, A.V. Bogatsky Physical-Chemical Institute National Academy of Sciences of Ukraine, Odessa, 65080, Ukraine
| | - Denis Fourches
- Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA
| | - Alexandre Varnek
- Department of Chemistry, L. Pasteur University of Strasbourg, Strasbourg, 67000, France
| | - Igor I. Baskin
- Department of Physics, Lomonosov Moscow State University, Moscow, 119991, Russia
| | - Mark Cronin
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool L33AF, UK
| | - John Dearden
- School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University, Liverpool L33AF, UK
| | - Paola Gramatica
- Department of Structural and Functional Biology, University of Insubria, Varese, 21100, Italy
| | | | - Roberto Todeschini
- Milano Chemometrics and QSAR Research Group, University of Milano-Bicocca, Milan, 20126, Italy
| | - Viviana Consonni
- Milano Chemometrics and QSAR Research Group, University of Milano-Bicocca, Milan, 20126, Italy
| | - Victor E. Kuz'min
- Department of Molecular Structure and Cheminformatics, A.V. Bogatsky Physical-Chemical Institute National Academy of Sciences of Ukraine, Odessa, 65080, Ukraine
| | | | - Romualdo Benigni
- Environment and Health Department, Istituto Superiore di Sanita’, Rome, 00161, Italy
| | | | - James Rathman
- Altamira LLC, Columbus OH 43235, USA
- Department of Chemical and Biomolecular Engineering, the Ohio State University, Columbus, OH 43215, USA
| | | | | | - Ann Richard
- National Center for Computational Toxicology, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27519, USA
| | - Alexander Tropsha
- Laboratory for Molecular Modeling, UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, 27599, USA
| |
Collapse
|
29
|
Hanser T, Barber C, Rosser E, Vessey JD, Webb SJ, Werner S. Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge. J Cheminform 2014; 6:21. [PMID: 24959206 PMCID: PMC4048587 DOI: 10.1186/1758-2946-6-21] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2013] [Accepted: 03/28/2014] [Indexed: 12/01/2022] Open
Abstract
Background Combining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models. Results To combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity. Conclusion It is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.
Collapse
|
30
|
Palczewska A, Palczewski J, Marchese Robinson R, Neagu D. Interpreting Random Forest Classification Models Using a Feature Contribution Method. INTEGRATION OF REUSABLE SYSTEMS 2014. [DOI: 10.1007/978-3-319-04717-1_9] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
31
|
Riniker S, Landrum GA. Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods. J Cheminform 2013; 5:43. [PMID: 24063533 PMCID: PMC3852750 DOI: 10.1186/1758-2946-5-43] [Citation(s) in RCA: 75] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2013] [Accepted: 07/23/2013] [Indexed: 02/03/2023] Open
Abstract
Fingerprint similarity is a common method for comparing chemical structures. Similarity is an appealing approach because, with many fingerprint types, it provides intuitive results: a chemist looking at two molecules can understand why they have been determined to be similar. This transparency is partially lost with the fuzzier similarity methods that are often used for scaffold hopping and tends to vanish completely when molecular fingerprints are used as inputs to machine-learning (ML) models. Here we present similarity maps, a straightforward and general strategy to visualize the atomic contributions to the similarity between two molecules or the predicted probability of a ML model. We show the application of similarity maps to a set of dopamine D3 receptor ligands using atom-pair and circular fingerprints as well as two popular ML methods: random forests and naïve Bayes. An open-source implementation of the method is provided.
Collapse
Affiliation(s)
- Sereina Riniker
- Novartis Institutes for BioMedical Research, Basel, Switzerland.
| | | |
Collapse
|
32
|
Polishchuk PG, Kuz'min VE, Artemenko AG, Muratov EN. Universal Approach for Structural Interpretation of QSAR/QSPR Models. Mol Inform 2013; 32:843-53. [DOI: 10.1002/minf.201300029] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2013] [Accepted: 07/29/2013] [Indexed: 11/07/2022]
|
33
|
Rosenbaum L, Dörr A, Bauer MR, Boeckler FM, Zell A. Inferring multi-target QSAR models with taxonomy-based multi-task learning. J Cheminform 2013; 5:33. [PMID: 23842210 PMCID: PMC4104930 DOI: 10.1186/1758-2946-5-33] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2013] [Accepted: 07/03/2013] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND A plethora of studies indicate that the development of multi-target drugs is beneficial for complex diseases like cancer. Accurate QSAR models for each of the desired targets assist the optimization of a lead candidate by the prediction of affinity profiles. Often, the targets of a multi-target drug are sufficiently similar such that, in principle, knowledge can be transferred between the QSAR models to improve the model accuracy. In this study, we present two different multi-task algorithms from the field of transfer learning that can exploit the similarity between several targets to transfer knowledge between the target specific QSAR models. RESULTS We evaluated the two methods on simulated data and a data set of 112 human kinases assembled from the public database ChEMBL. The relatedness between the kinase targets was derived from the taxonomy of the humane kinome. The experiments show that multi-task learning increases the performance compared to training separate models on both types of data given a sufficient similarity between the tasks. On the kinase data, the best multi-task approach improved the mean squared error of the QSAR models of 58 kinase targets. CONCLUSIONS Multi-task learning is a valuable approach for inferring multi-target QSAR models for lead optimization. The application of multi-task learning is most beneficial if knowledge can be transferred from a similar task with a lot of in-domain knowledge to a task with little in-domain knowledge. Furthermore, the benefit increases with a decreasing overlap between the chemical space spanned by the tasks.
Collapse
Affiliation(s)
- Lars Rosenbaum
- Center for Bioinformatics (ZBIT), University of Tübingen, Sand 1,
Tübingen 72076, Germany
| | - Alexander Dörr
- Center for Bioinformatics (ZBIT), University of Tübingen, Sand 1,
Tübingen 72076, Germany
| | - Matthias R Bauer
- Institute of Pharmaceutical Sciences, University of Tübingen, Auf der
Morgenstelle 8, Tübingen 72076, Germany
| | - Frank M Boeckler
- Institute of Pharmaceutical Sciences, University of Tübingen, Auf der
Morgenstelle 8, Tübingen 72076, Germany
| | - Andreas Zell
- Center for Bioinformatics (ZBIT), University of Tübingen, Sand 1,
Tübingen 72076, Germany
| |
Collapse
|
34
|
Chen H, Carlsson L, Eriksson M, Varkonyi P, Norinder U, Nilsson I. Beyond the Scope of Free-Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms. J Chem Inf Model 2013; 53:1324-36. [DOI: 10.1021/ci4001376] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
| | | | | | | | - Ulf Norinder
- CNSP Innovative Medicines, AstraZeneca R&D Södertälje, Sweden
| | | |
Collapse
|
35
|
Vlachakis D, Tsiliki G, Pavlopoulou A, Roubelakis MG, Tsaniras SC, Kossida S. Antiviral Stratagems Against HIV-1 Using RNA Interference (RNAi) Technology. Evol Bioinform Online 2013; 9:203-13. [PMID: 23761954 PMCID: PMC3662398 DOI: 10.4137/ebo.s11412] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The versatility of human immunodeficiency virus (HIV)-1 and its evolutionary potential to elude antiretroviral agents by mutating may be its most invincible weapon. Viruses, including HIV, in order to adapt and survive in their environment evolve at extremely fast rates. Given that conventional approaches which have been applied against HIV have failed, novel and more promising approaches must be employed. Recent studies advocate RNA interference (RNAi) as a promising therapeutic tool against HIV. In this regard, targeting multiple HIV sites in the context of a combinatorial RNAi-based approach may efficiently stop viral propagation at an early stage. Moreover, large high-throughput RNAi screens are widely used in the fields of drug development and reverse genetics. Computer-based algorithms, bioinformatics, and biostatistical approaches have been employed in traditional medicinal chemistry discovery protocols for low molecular weight compounds. However, the diversity and complexity of RNAi screens cannot be efficiently addressed by these outdated approaches. Herein, a series of novel workflows for both wet- and dry-lab strategies are presented in an effort to provide an updated review of state-of-the-art RNAi technologies, which may enable adequate progress in the fight against the HIV-1 virus.
Collapse
Affiliation(s)
- Dimitrios Vlachakis
- Bioinformatics and Medical Informatics Team, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | | | | | | | | | | |
Collapse
|
36
|
Reutlinger M, Schneider G. Nonlinear dimensionality reduction and mapping of compound libraries for drug discovery. J Mol Graph Model 2012; 34:108-17. [PMID: 22326864 DOI: 10.1016/j.jmgm.2011.12.006] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2011] [Revised: 12/13/2011] [Accepted: 12/14/2011] [Indexed: 01/29/2023]
Abstract
Visualization of 'chemical space' and compound distributions has received much attraction by medicinal chemists as it may help to intuitively comprehend pharmaceutically relevant molecular features. It has been realized that for meaningful feature extraction from complex multivariate chemical data, such as compound libraries represented by many molecular descriptors, nonlinear projection techniques are required. Recent advances in machine-learning and artificial intelligence have resulted in a transfer of such methods to chemistry. We provide an overview of prominent visualization methods based on nonlinear dimensionality reduction, and highlight applications in drug discovery. Emphasis is on neural network techniques, kernel methods and stochastic embedding approaches, which have been successfully used for ligand-based virtual screening, SAR landscape analysis, combinatorial library design, and screening compound selection.
Collapse
Affiliation(s)
- Michael Reutlinger
- Swiss Federal Institute of Technology (ETH), Department of Chemistry and Applied Biosciences, Zurich, Switzerland
| | | |
Collapse
|