1
|
Hutter MC. Differential Multimolecule Fingerprint for Similarity Search─Making Use of Active and Inactive Compound Sets in Virtual Screening. J Chem Inf Model 2022; 62:2726-2736. [PMID: 35613341 DOI: 10.1021/acs.jcim.2c00242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
In conventional fingerprint methods, the similarity between two molecules is calculated using the Tanimoto index as a numerical criterion. Thus, the query molecules in virtual screening should be most representative of the wanted compound class at hand. In the concept introduced here, all available active molecules form a multimolecule fingerprint in which the appearing features are weighted according to their respective frequency. The features of inactive molecules are treated likewise and the resulting values are subtracted from those of the active ones. The obtained differential multimolecule fingerprint (DMMFP) is thus specific for the respective class of compounds. To account for the noninteger representation within this fingerprint, a modified Sørensen-Dice coefficient is used to compute the similarity. Potentially active molecules yield positive scores, whereas presumably inactive ones are denoted by negative values. The concept was applied to Angiotensin-converting enzyme (ACE) inhibitors, β2-adrenoceptor ligands, leukotriene A4 hydrolase inhibitors, dopamine D3 antagonists, and cytochrome CYP2C9 substrates, for which experimental binding affinities are known and was tested against decoys from DUD-E and a further background database consisting of molecules from the dark chemical matter, which comprises compounds that appear as frequent hitters across multiple assays. Using the 166 publicly available keys of the MACCS fingerprint and the larger PubChem fingerprint, actives were recovered with very high sensitivity. Furthermore, three marketed ACE inhibitors as well as the carbonic anhydrase II inhibitor dorzolamide were detected in the dark chemical matter data set. For comparison, the DMMFP was also used with a Bayesian classifier, for which the specificity (correctly classified inactives) and likewise the accuracy was superior. Conversely, the similarity score produced by the Sørensen-Dice coefficient showed its potential for the early recognition of (potentially) active molecules.
Collapse
Affiliation(s)
- Michael C Hutter
- Center for Bioinformatics, Saarland University, Campus E2.1, 66123 Saarbruecken, Germany
| |
Collapse
|
2
|
Carbon-Mangels M, Hutter MC. Selecting Relevant Descriptors for Classification by Bayesian Estimates: A Comparison with Decision Trees and Support Vector Machines Approaches for Disparate Data Sets. Mol Inform 2011; 30:885-95. [PMID: 27468108 DOI: 10.1002/minf.201100069] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Accepted: 08/19/2011] [Indexed: 11/12/2022]
Abstract
Classification algorithms suffer from the curse of dimensionality, which leads to overfitting, particularly if the problem is over-determined. Therefore it is of particular interest to identify the most relevant descriptors to reduce the complexity. We applied Bayesian estimates to model the probability distribution of descriptors values used for binary classification using n-fold cross-validation. As a measure for the discriminative power of the classifiers, the symmetric form of the Kullback-Leibler divergence of their probability distributions was computed. We found that the most relevant descriptors possess a Gaussian-like distribution of their values, show the largest divergences, and therefore appear most often in the cross-validation scenario. The results were compared to those of the LASSO feature selection method applied to multiple decision trees and support vector machine approaches for data sets of substrates and nonsubstrates of three Cytochrome P450 isoenzymes, which comprise strongly unbalanced compound distributions. In contrast to decision trees and support vector machines, the performance of Bayesian estimates is less affected by unbalanced data sets. This strategy reveals those descriptors that allow a simple linear separation of the classes, whereas the superior accuracy of decision trees and support vector machines can be attributed to nonlinear separation, which are in turn more prone to overfitting.
Collapse
Affiliation(s)
- Miriam Carbon-Mangels
- Section of Biostatistics, Paul-Ehrlich-Institut, Federal Institute for Vaccines and Biomedicines, Paul-Ehrlich-Straße 51-59, 63225 Langen, Germany
| | - Michael C Hutter
- Center for Bioinformatics, Saarland University, Campus Building E2.1, 66123 Saarbrücken, Germany phone/fax: +49 681 302 70703/70702.
| |
Collapse
|
3
|
Ramesh M, Bharatam PV. CYP isoform specificity toward drug metabolism: analysis using common feature hypothesis. J Mol Model 2011; 18:709-20. [DOI: 10.1007/s00894-011-1105-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Accepted: 04/20/2011] [Indexed: 02/02/2023]
|
4
|
Michielan L, Terfloth L, Gasteiger J, Moro S. Comparison of Multilabel and Single-Label Classification Applied to the Prediction of the Isoform Specificity of Cytochrome P450 Substrates. J Chem Inf Model 2009; 49:2588-605. [DOI: 10.1021/ci900299a] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Affiliation(s)
- Lisa Michielan
- Molecular Modeling Section (MMS), Dipartimento di Scienze Farmaceutiche, Università di Padova, via Marzolo 5, I-35131, Padova, Italy, Molecular Networks GmbH, Henkestrasse 91, D-91052, Erlangen, Germany, and Computer-Chemie-Centrum and Institut für Organische Chemie, Universität Erlangen-Nürnberg, Nägelsbachstrasse 25, D-91052, Erlangen, Germany
| | - Lothar Terfloth
- Molecular Modeling Section (MMS), Dipartimento di Scienze Farmaceutiche, Università di Padova, via Marzolo 5, I-35131, Padova, Italy, Molecular Networks GmbH, Henkestrasse 91, D-91052, Erlangen, Germany, and Computer-Chemie-Centrum and Institut für Organische Chemie, Universität Erlangen-Nürnberg, Nägelsbachstrasse 25, D-91052, Erlangen, Germany
| | - Johann Gasteiger
- Molecular Modeling Section (MMS), Dipartimento di Scienze Farmaceutiche, Università di Padova, via Marzolo 5, I-35131, Padova, Italy, Molecular Networks GmbH, Henkestrasse 91, D-91052, Erlangen, Germany, and Computer-Chemie-Centrum and Institut für Organische Chemie, Universität Erlangen-Nürnberg, Nägelsbachstrasse 25, D-91052, Erlangen, Germany
| | - Stefano Moro
- Molecular Modeling Section (MMS), Dipartimento di Scienze Farmaceutiche, Università di Padova, via Marzolo 5, I-35131, Padova, Italy, Molecular Networks GmbH, Henkestrasse 91, D-91052, Erlangen, Germany, and Computer-Chemie-Centrum and Institut für Organische Chemie, Universität Erlangen-Nürnberg, Nägelsbachstrasse 25, D-91052, Erlangen, Germany
| |
Collapse
|