1
|
Cerruela García G, Pérez-Parras Toledano J, de Haro García A, García-Pedrajas N. Filter feature selectors in the development of binary QSAR models. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2019; 30:313-345. [PMID: 31112077 DOI: 10.1080/1062936x.2019.1588160] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 02/25/2019] [Indexed: 06/09/2023]
Abstract
The application of machine learning methods to the construction of quantitative structure-activity relationship models is a complex computational problem in which dimensionality reduction of the representation of the molecular structure plays a fundamental role in predicting a target activity. The feature selection pre-processing approach has been indicated to be effective in dimensionality reduction for building simpler and more understandable models. In this paper, a performance comparative study of 13 state-of-the-art feature selection filter methods is conducted. Structure-activity relationship models are constructed using three widely used classifiers and a diverse collection of datasets. The comparative study utilizes robust statistical tests to compare the algorithms. According to the experimental results, there are substantial differences in performance among the evaluated feature selection methods. The methods that exhibit the best performance are correlation-based feature selection, fast clustering-based feature selection and the set cover method.
Collapse
Affiliation(s)
- G Cerruela García
- a Department of Computing and Numerical Analysis , University of Córdoba, Campus de Rabanales, Albert Einstein Building , E-14071 Córdoba , Spain
| | - J Pérez-Parras Toledano
- a Department of Computing and Numerical Analysis , University of Córdoba, Campus de Rabanales, Albert Einstein Building , E-14071 Córdoba , Spain
| | - A de Haro García
- a Department of Computing and Numerical Analysis , University of Córdoba, Campus de Rabanales, Albert Einstein Building , E-14071 Córdoba , Spain
| | - N García-Pedrajas
- a Department of Computing and Numerical Analysis , University of Córdoba, Campus de Rabanales, Albert Einstein Building , E-14071 Córdoba , Spain
| |
Collapse
|
2
|
Cerruela García G, García-Pedrajas N. Boosted feature selectors: a case study on prediction P-gp inhibitors and substrates. J Comput Aided Mol Des 2018; 32:1273-1294. [PMID: 30367310 DOI: 10.1007/s10822-018-0171-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2018] [Accepted: 10/18/2018] [Indexed: 01/11/2023]
Abstract
Feature selection is commonly used as a preprocessing step to machine learning for improving learning performance, lowering computational complexity and facilitating model interpretation. This paper proposes the application of boosting feature selection to improve the classification performance of standard feature selection algorithms evaluated for the prediction of P-gp inhibitors and substrates. Two well-known classification algorithms, decision trees and support vector machines, were used to classify the chemical compounds. The experimental results showed better performance for boosting feature selection with respect to the standard feature selection algorithms while maintaining the capability for feature reduction.
Collapse
Affiliation(s)
- Gonzalo Cerruela García
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, 14071, Córdoba, Spain.
| | - Nicolás García-Pedrajas
- Department of Computing and Numerical Analysis, University of Córdoba, Campus de Rabanales, Albert Einstein Building, 14071, Córdoba, Spain
| |
Collapse
|
3
|
Hsu CW, Hewes KP, Stavitskaya L, Kruhlak NL. Construction and application of (Q)SAR models to predict chemical-induced in vitro chromosome aberrations. Regul Toxicol Pharmacol 2018; 99:274-288. [PMID: 30278198 DOI: 10.1016/j.yrtph.2018.09.026] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2018] [Revised: 09/24/2018] [Accepted: 09/26/2018] [Indexed: 12/23/2022]
Abstract
In drug development, genetic toxicology studies are conducted using in vitro and in vivo assays to identify potential mutagenic and clastogenic effects, as outlined in the International Council for Harmonisation (ICH) S2 regulatory guideline. (Quantitative) structure-activity relationship ((Q)SAR) models that predict assay outcomes can be used as an early screen to prioritize pharmaceutical candidates, or later during product development to evaluate safety when experimental data are unavailable or inconclusive. In the current study, two commercial QSAR platforms were used to build models for in vitro chromosomal aberrations in Chinese hamster lung (CHL) and Chinese hamster ovary (CHO) cells. Cross-validated CHL model predictive performance showed sensitivity of 80 and 82%, and negative predictivity of 75 and 76% based on 875 training set compounds. For CHO, sensitivity of 61 and 67% and negative predictivity of 68 and 74% was achieved based on 817 training set compounds. The predictive performance of structural alerts in a commercial expert rule-based SAR software was also investigated and showed positive predictivity of 48-100% for selected alerts. Case studies examining incorrectly-predicted compounds, non-DNA-reactive clastogens, and recently-approved pharmaceuticals are presented, exploring how an investigational approach using similarity searching and expert knowledge can improve upon individual (Q)SAR predictions of the clastogenicity of drugs.
Collapse
Affiliation(s)
- Chia-Wen Hsu
- US Food and Drug Administration, Center for Drug Evaluation and Research, Silver Spring, MD, USA
| | - Kurt P Hewes
- US Food and Drug Administration, Center for Drug Evaluation and Research, Silver Spring, MD, USA
| | - Lidiya Stavitskaya
- US Food and Drug Administration, Center for Drug Evaluation and Research, Silver Spring, MD, USA
| | - Naomi L Kruhlak
- US Food and Drug Administration, Center for Drug Evaluation and Research, Silver Spring, MD, USA.
| |
Collapse
|
4
|
Toropov AA, Toropova AP, Raitano G, Benfenati E. CORAL: Building up QSAR models for the chromosome aberration test. Saudi J Biol Sci 2018; 26:1101-1106. [PMID: 31516335 PMCID: PMC6734133 DOI: 10.1016/j.sjbs.2018.05.013] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2017] [Revised: 04/23/2018] [Accepted: 05/06/2018] [Indexed: 12/13/2022] Open
Abstract
A high level of chromosomal aberrations in peripheral blood lymphocytes may be an early marker of cancer risk, but data on risk of specific cancers and types of chromosomal aberrations are limited. Consequently, the development of predictive models for chromosomal aberrations test is important task. Majority of models for chromosomal aberrations test are so-called knowledge-based rules system. The CORAL software (http://www.insilico.eu/coral, abbreviation of “CORrelation And Logic”) is an alternative for knowledge-based rules system. In contrast to knowledge-based rules system, the CORAL software gives possibility to estimate the influence upon the predictive potential of a model of different molecular alerts as well as different splits into the training set and validation set. This possibility is not available for the approaches based on the knowledge-based rules system. Quantitative Structure–Activity Relationships (QSAR) for chromosome aberration test are established for five random splits into the training, calibration, and validation sets. The QSAR approach is based on representation of the molecular structure by simplified molecular input-line entry system (SMILES) without data on physicochemical and/or biochemical parameters. In spite of this limitation, the statistical quality of these models is quite good.
Collapse
Affiliation(s)
| | - Alla P. Toropova
- Corresponding author at: Laboratory of Environmental Chemistry and Toxicology, IRCCS – Istituto di Ricerche Farmacologiche Mario Negri, Via La Masa 19, 20156 Milano, Italy.
| | | | | |
Collapse
|
5
|
Fan D, Yang H, Li F, Sun L, Di P, Li W, Tang Y, Liu G. In silico prediction of chemical genotoxicity using machine learning methods and structural alerts. Toxicol Res (Camb) 2018; 7:211-220. [PMID: 30090576 PMCID: PMC6062245 DOI: 10.1039/c7tx00259a] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Accepted: 12/14/2017] [Indexed: 01/19/2023] Open
Abstract
Genotoxicity tests can detect compounds that have an adverse effect on the process of heredity. The in vivo micronucleus assay, a genotoxicity test method, has been widely used to evaluate the presence and extent of chromosomal damage in human beings. Due to the high cost and laboriousness of experimental tests, computational approaches for predicting genotoxicity based on chemical structures and properties are recognized as an alternative. In this study, a dataset containing 641 diverse chemicals was collected and the molecules were represented by both fingerprints and molecular descriptors. Then classification models were constructed by six machine learning methods, including the support vector machine (SVM), naïve Bayes (NB), k-nearest neighbor (kNN), C4.5 decision tree (DT), random forest (RF) and artificial neural network (ANN). The performance of the models was estimated by five-fold cross-validation and an external validation set. The top ten models showed excellent performance for the external validation with accuracies ranging from 0.846 to 0.938, among which models Pubchem_SVM and MACCS_RF showed a more reliable predictive ability. The applicability domain was also defined to distinguish favorable predictions from unfavorable ones. Finally, ten structural fragments which can be used to assess the genotoxicity potential of a chemical were identified by using information gain and structural fragment frequency analysis. Our models might be helpful for the initial screening of potential genotoxic compounds.
Collapse
Affiliation(s)
- Defang Fan
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| | - Hongbin Yang
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| | - Fuxing Li
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| | - Lixia Sun
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| | - Peiwen Di
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| | - Weihua Li
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| | - Yun Tang
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| | - Guixia Liu
- Shanghai Key Laboratory of New Drug Design , School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China . ; ; ; Tel: +86-21-64250811
| |
Collapse
|
6
|
Marchese Robinson RL, Palczewska A, Palczewski J, Kidley N. Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets. J Chem Inf Model 2017; 57:1773-1792. [PMID: 28715209 DOI: 10.1021/acs.jcim.6b00753] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The ability to interpret the predictions made by quantitative structure-activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package ( https://r-forge.r-project.org/R/?group_id=1725 ) for the R statistical programming language and the Python program HeatMapWrapper [ https://doi.org/10.5281/zenodo.495163 ] for heat map generation.
Collapse
Affiliation(s)
- Richard L Marchese Robinson
- Syngenta Ltd., Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, United Kingdom.,School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University , James Parsons Building, Byrom Street, Liverpool L3 3AF, United Kingdom
| | - Anna Palczewska
- Department of Computing, University of Bradford , Bradford BD7 1DP, United Kingdom
| | - Jan Palczewski
- School of Mathematics, University of Leeds , Leeds LS2 9JT, United Kingdom
| | - Nathan Kidley
- Syngenta Ltd., Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, United Kingdom
| |
Collapse
|
7
|
Klambauer G, Wischenbart M, Mahr M, Unterthiner T, Mayr A, Hochreiter S. Rchemcpp: a web service for structural analoging in ChEMBL, Drugbank and the Connectivity Map. Bioinformatics 2015; 31:3392-4. [PMID: 26088801 DOI: 10.1093/bioinformatics/btv373] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2015] [Accepted: 06/11/2015] [Indexed: 01/27/2023] Open
Abstract
UNLABELLED We have developed Rchempp, a web service that identifies structurally similar compounds (structural analogs) in large-scale molecule databases. The service allows compounds to be queried in the widely used ChEMBL, DrugBank and the Connectivity Map databases. Rchemcpp utilizes the best performing similarity functions, i.e. molecule kernels, as measures for structural similarity. Molecule kernels have proven superior performance over other similarity measures and are currently excelling at machine learning challenges. To considerably reduce computational time, and thereby make it feasible as a web service, a novel efficient prefiltering strategy has been developed, which maintains the sensitivity of the method. By exploiting information contained in public databases, the web service facilitates many applications crucial for the drug development process, such as prioritizing compounds after screening or reducing adverse side effects during late phases. Rchemcpp was used in the DeepTox pipeline that has won the Tox21 Data Challenge and is frequently used by researchers in pharmaceutical companies. AVAILABILITY AND IMPLEMENTATION The web service and the R package are freely available via http://shiny.bioinf.jku.at/Analoging/ and via Bioconductor. CONTACT hochreit@bioinf.jku.at SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Günter Klambauer
- Institute of Bioinformatics, Johannes Kepler University Linz, Altenbergerstr. 69, 4040 Linz, Austria
| | - Martin Wischenbart
- Institute of Bioinformatics, Johannes Kepler University Linz, Altenbergerstr. 69, 4040 Linz, Austria
| | - Michael Mahr
- Institute of Bioinformatics, Johannes Kepler University Linz, Altenbergerstr. 69, 4040 Linz, Austria
| | - Thomas Unterthiner
- Institute of Bioinformatics, Johannes Kepler University Linz, Altenbergerstr. 69, 4040 Linz, Austria
| | - Andreas Mayr
- Institute of Bioinformatics, Johannes Kepler University Linz, Altenbergerstr. 69, 4040 Linz, Austria
| | - Sepp Hochreiter
- Institute of Bioinformatics, Johannes Kepler University Linz, Altenbergerstr. 69, 4040 Linz, Austria
| |
Collapse
|
8
|
Balfer J, Bajorath J. Visualization and Interpretation of Support Vector Machine Activity Predictions. J Chem Inf Model 2015; 55:1136-47. [DOI: 10.1021/acs.jcim.5b00175] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Jenny Balfer
- Department of Life Science
Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal
Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science
Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal
Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany
| |
Collapse
|
9
|
Vlachakis D, Tsiliki G, Pavlopoulou A, Roubelakis MG, Tsaniras SC, Kossida S. Antiviral Stratagems Against HIV-1 Using RNA Interference (RNAi) Technology. Evol Bioinform Online 2013; 9:203-13. [PMID: 23761954 PMCID: PMC3662398 DOI: 10.4137/ebo.s11412] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The versatility of human immunodeficiency virus (HIV)-1 and its evolutionary potential to elude antiretroviral agents by mutating may be its most invincible weapon. Viruses, including HIV, in order to adapt and survive in their environment evolve at extremely fast rates. Given that conventional approaches which have been applied against HIV have failed, novel and more promising approaches must be employed. Recent studies advocate RNA interference (RNAi) as a promising therapeutic tool against HIV. In this regard, targeting multiple HIV sites in the context of a combinatorial RNAi-based approach may efficiently stop viral propagation at an early stage. Moreover, large high-throughput RNAi screens are widely used in the fields of drug development and reverse genetics. Computer-based algorithms, bioinformatics, and biostatistical approaches have been employed in traditional medicinal chemistry discovery protocols for low molecular weight compounds. However, the diversity and complexity of RNAi screens cannot be efficiently addressed by these outdated approaches. Herein, a series of novel workflows for both wet- and dry-lab strategies are presented in an effort to provide an updated review of state-of-the-art RNAi technologies, which may enable adequate progress in the fight against the HIV-1 virus.
Collapse
Affiliation(s)
- Dimitrios Vlachakis
- Bioinformatics and Medical Informatics Team, Biomedical Research Foundation, Academy of Athens, Athens, Greece
| | | | | | | | | | | |
Collapse
|
10
|
Rosenbaum L, Jahn A, Dörr A, Zell A. Optimization and visualization of the edge weights in optimal assignment methods for virtual screening. BioData Min 2013; 6:7. [PMID: 23531368 PMCID: PMC3639874 DOI: 10.1186/1756-0381-6-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2012] [Accepted: 03/10/2013] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Ligand-based virtual screening plays a fundamental part in the early drug discovery stage. In a virtual screening, a chemical library is searched for molecules with similar properties to a query molecule by means of a similarity function. The optimal assignment of chemical graphs has proven to be a valuable similarity function for many cheminformatic tasks, such as virtual screening. The optimal assignment assumes all atoms of a query molecule to be equally important, which is not realistic depending on the binding mode of a ligand. The importance of a query molecule's atoms can be integrated in the optimal assignment by weighting the assignment edges. We optimized the edge weights with respect to the virtual screening performance by means of evolutionary algorithms. Furthermore, we propose a visualization approach for the interpretation of the edge weights. RESULTS We evaluated two different evolutionary algorithms, differential evolution and particle swarm optimization, for their suitability for optimizing the assignment edge weights. The results showed that both optimization methods are suited to optimize the edge weights. Furthermore, we compared our approach to the optimal assignment with equal edge weights and two literature similarity functions on a subset of the Directory of Useful Decoys using sophisticated virtual screening performance metrics. Our approach achieved a considerably better overall and early enrichment performance. The visualization of the edge weights enables the identification of substructures that are important for a good retrieval of ligands and for the binding to the protein target. CONCLUSIONS The optimization of the edge weights in optimal assignment methods is a valuable approach for ligand-based virtual screening experiments. The approach can be applied to any similarity function that employs the optimal assignment method, which includes a variety of similarity measures that have proven to be valuable in various cheminformatic tasks. The proposed visualization helps to get a better understanding of the binding mode of the analyzed query molecule.
Collapse
Affiliation(s)
- Lars Rosenbaum
- University of Tübingen, Center for Bioinformatics (ZBIT), Sand 1, 72076 Tübingen, Germany.
| | | | | | | |
Collapse
|
11
|
Akutsu T, Nagamochi H. Comparison and enumeration of chemical graphs. Comput Struct Biotechnol J 2013; 5:e201302004. [PMID: 24688697 PMCID: PMC3962186 DOI: 10.5936/csbj.201302004] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2012] [Revised: 12/23/2012] [Accepted: 12/24/2012] [Indexed: 11/22/2022] Open
Abstract
Chemical compounds are usually represented as graph structured data in computers. In this review article, we overview several graph classes relevant to chemical compounds and the computational complexities of several fundamental problems for these graph classes. In particular, we consider the following problems: determining whether two chemical graphs are identical, determining whether one input chemical graph is a part of the other input chemical graph, finding a maximum common part of two input graphs, finding a reaction atom mapping, enumerating possible chemical graphs, and enumerating stereoisomers. We also discuss the relationship between the fifth problem and kernel functions for chemical compounds.
Collapse
Affiliation(s)
- Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Hiroshi Nagamochi
- Graduate School of Informatics, Kyoto University, Yoshida, Kyoto 606-8501, Japan
| |
Collapse
|
12
|
Vogt M, Bajorath J. Chemoinformatics: A view of the field and current trends in method development. Bioorg Med Chem 2012; 20:5317-23. [DOI: 10.1016/j.bmc.2012.03.030] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2012] [Revised: 03/09/2012] [Accepted: 03/12/2012] [Indexed: 12/18/2022]
|
13
|
Rosenbaum L, Hinselmann G, Jahn A, Zell A. Interpreting linear support vector machine models with heat map molecule coloring. J Cheminform 2011; 3:11. [PMID: 21439031 PMCID: PMC3076244 DOI: 10.1186/1758-2946-3-11] [Citation(s) in RCA: 43] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2010] [Accepted: 03/25/2011] [Indexed: 11/17/2022] Open
Abstract
Background Model-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity. Results We evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor. Conclusions In combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.
Collapse
Affiliation(s)
- Lars Rosenbaum
- University of Tübingen, Center for Bioinformatics (ZBIT), Sand 1, 72076 Tübingen, Germany.
| | | | | | | |
Collapse
|