1
|
Mastropietro A, Bajorath J. Protocol to explain support vector machine predictions via exact Shapley value computation. STAR Protoc 2024; 5:103010. [PMID: 38607924 PMCID: PMC11017346 DOI: 10.1016/j.xpro.2024.103010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 03/11/2024] [Accepted: 03/25/2024] [Indexed: 04/14/2024] Open
Abstract
Shapley values from cooperative game theory are adapted for explaining machine learning predictions. For large feature sets used in machine learning, Shapley values are approximated. We present a protocol for two techniques for explaining support vector machine predictions with exact Shapley value computation. We detail the application of these algorithms and provide ready-to-use Python scripts and custom code. The final output of the protocol includes quantitative feature analysis and mapping of important features for visualization. For complete details on the use and execution of this protocol, please refer to Feldmann and Bajorath1 and Mastropietro et al.2.
Collapse
Affiliation(s)
- Andrea Mastropietro
- Deparment of Computer, Control and Management Engineering "Antonio Ruberti", Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy.
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany; Lamarr Institute for Machine Learning and Artificial Intelligence, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany.
| |
Collapse
|
2
|
Sidorov P, Tsuji N. A Primer on 2D Descriptors in Selectivity Modeling for Asymmetric Catalysis. Chemistry 2024; 30:e202302837. [PMID: 38010242 DOI: 10.1002/chem.202302837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 11/21/2023] [Accepted: 11/23/2023] [Indexed: 11/29/2023]
Abstract
Machine learning has permeated all fields of research, including chemistry, and is now an integral part of the design of novel compounds with desired properties. In the field of asymmetric catalysis, the preference still lies with models based on a physical understanding of the catalysis phenomenon and the electronic and steric properties of catalysts. However, such models require quantum chemical calculations and are thus limited by their computational cost. Here, we highlight the recent advances in modeling catalyst selectivity by using the 2D structures of catalysts and substrates. While these have a less explicit mechanistic connection to the modeled property, 2D descriptors, such as topological indices, molecular fingerprints, and fragments, offer the tremendous advantages of low cost and high speed of calculations. This makes them optimal for the in-silico screening of large amounts of data. We provide an overview of common quantitative structure-property relationship workflow, model building and validation techniques, applications of these methodologies in asymmetric catalysis design, and an outlook on improving the understanding of 2D-based models.
Collapse
Affiliation(s)
- Pavel Sidorov
- Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Sapporo, 001-0021, Japan
| | - Nobuya Tsuji
- Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Sapporo, 001-0021, Japan
| |
Collapse
|
3
|
Janela T, Bajorath J. Anatomy of Potency Predictions Focusing on Structural Analogues with Increasing Potency Differences Including Activity Cliffs. J Chem Inf Model 2023; 63:7032-7044. [PMID: 37943257 DOI: 10.1021/acs.jcim.3c01530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2023]
Abstract
Potency predictions are popular in compound design and optimization but are complicated by intrinsic limitations. Moreover, even for nonlinear methods, activity cliffs (ACs, formed by structural analogues with large potency differences) represent challenging test cases for compound potency predictions. We have devised a new test system for potency predictions, including AC compounds, that is based on partitioned matched molecular pairs (MMP) and makes it possible to monitor prediction accuracy at the level of analogue pairs with increasing potency differences. The results of systematic predictions using different machine learning and control methods on MMP-based data sets revealed increasing prediction errors when potency differences between corresponding training and test compounds increased, including large prediction errors for AC compounds. At the global level, these prediction errors were not apparent due to the statistical dominance of analogue pairs with small potency differences. Test compounds from such pairs were accurately predicted and determined the observed global prediction accuracy. Shapley value analysis, an explainable artificial intelligence approach, was applied to identify structural features determining potency predictions using different methods. The analysis revealed that numerical predictions of different regression models were determined by features that were shared by MMP partner compounds or absent in these compounds, with opposing effects. These findings provided another rationale for accurate predictions of similar potency values for structural analogues and failures in predicting the potency of AC compounds.
Collapse
Affiliation(s)
- Tiago Janela
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| |
Collapse
|
4
|
Mastropietro A, Feldmann C, Bajorath J. Calculation of exact Shapley values for explaining support vector machine models using the radial basis function kernel. Sci Rep 2023; 13:19561. [PMID: 37949930 PMCID: PMC10638308 DOI: 10.1038/s41598-023-46930-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 11/07/2023] [Indexed: 11/12/2023] Open
Abstract
Machine learning (ML) algorithms are extensively used in pharmaceutical research. Most ML models have black-box character, thus preventing the interpretation of predictions. However, rationalizing model decisions is of critical importance if predictions should aid in experimental design. Accordingly, in interdisciplinary research, there is growing interest in explaining ML models. Methods devised for this purpose are a part of the explainable artificial intelligence (XAI) spectrum of approaches. In XAI, the Shapley value concept originating from cooperative game theory has become popular for identifying features determining predictions. The Shapley value concept has been adapted as a model-agnostic approach for explaining predictions. Since the computational time required for Shapley value calculations scales exponentially with the number of features used, local approximations such as Shapley additive explanations (SHAP) are usually required in ML. The support vector machine (SVM) algorithm is one of the most popular ML methods in pharmaceutical research and beyond. SVM models are often explained using SHAP. However, there is only limited correlation between SHAP and exact Shapley values, as previously demonstrated for SVM calculations using the Tanimoto kernel, which limits SVM model explanation. Since the Tanimoto kernel is a special kernel function mostly applied for assessing chemical similarity, we have developed the Shapley value-expressed radial basis function (SVERAD), a computationally efficient approach for the calculation of exact Shapley values for SVM models based upon radial basis function kernels that are widely applied in different areas. SVERAD is shown to produce meaningful explanations of SVM predictions.
Collapse
Affiliation(s)
- Andrea Mastropietro
- Department of Computer, Control and Management Engineering "Antonio Ruberti", Sapienza University of Rome, 00185, Rome, Italy
| | - Christian Feldmann
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|
5
|
Gambacorta N, Ciriaco F, Amoroso N, Altomare CD, Bajorath J, Nicolotti O. CIRCE: Web-Based Platform for the Prediction of Cannabinoid Receptor Ligands Using Explainable Machine Learning. J Chem Inf Model 2023; 63:5916-5926. [PMID: 37675493 DOI: 10.1021/acs.jcim.3c00914] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/08/2023]
Abstract
The endocannabinoid system, which includes cannabinoid receptor 1 and 2 subtypes (CB1R and CB2R, respectively), is responsible for the onset of various pathologies including neurodegeneration, cancer, neuropathic and inflammatory pain, obesity, and inflammatory bowel disease. Given the high similarity of CB1R and CB2R, generating subtype-selective ligands is still an open challenge. In this work, the Cannabinoid Iterative Revaluation for Classification and Explanation (CIRCE) compound prediction platform has been generated based on explainable machine learning to support the design of selective CB1R and CB2R ligands. Multilayer classifiers were combined with Shapley value analysis to facilitate explainable predictions. In test calculations, CIRCE predictions reached ∼80% accuracy and structural features determining ligand predictions were rationalized. CIRCE was designed as a web-based prediction platform that is made freely available as a part of our study.
Collapse
Affiliation(s)
- Nicola Gambacorta
- Dipartimento di Farmacia Scienze del Farmaco, Università degli Studi di Bari "Aldo Moro", Via E. Orabona, 4, I-70125 Bari, Italy
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| | - Fulvio Ciriaco
- Dipartimento di Chimica, Università degli Studi di Bari "Aldo Moro", Via E. Orabona, 4, I-70125 Bari, Italy
| | - Nicola Amoroso
- Dipartimento di Farmacia Scienze del Farmaco, Università degli Studi di Bari "Aldo Moro", Via E. Orabona, 4, I-70125 Bari, Italy
| | - Cosimo Damiano Altomare
- Dipartimento di Farmacia Scienze del Farmaco, Università degli Studi di Bari "Aldo Moro", Via E. Orabona, 4, I-70125 Bari, Italy
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| | - Orazio Nicolotti
- Dipartimento di Farmacia Scienze del Farmaco, Università degli Studi di Bari "Aldo Moro", Via E. Orabona, 4, I-70125 Bari, Italy
| |
Collapse
|
6
|
Amara K, Rodríguez-Pérez R, Jiménez-Luna J. Explaining compound activity predictions with a substructure-aware loss for graph neural networks. J Cheminform 2023; 15:67. [PMID: 37491407 PMCID: PMC10369817 DOI: 10.1186/s13321-023-00733-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 07/08/2023] [Indexed: 07/27/2023] Open
Abstract
Explainable machine learning is increasingly used in drug discovery to help rationalize compound property predictions. Feature attribution techniques are popular choices to identify which molecular substructures are responsible for a predicted property change. However, established molecular feature attribution methods have so far displayed low performance for popular deep learning algorithms such as graph neural networks (GNNs), especially when compared with simpler modeling alternatives such as random forests coupled with atom masking. To mitigate this problem, a modification of the regression objective for GNNs is proposed to specifically account for common core structures between pairs of molecules. The presented approach shows higher accuracy on a recently-proposed explainability benchmark. This methodology has the potential to assist with model explainability in drug discovery pipelines, particularly in lead optimization efforts where specific chemical series are investigated.
Collapse
Affiliation(s)
- Kenza Amara
- Microsoft Research AI4Science, 21 Station Rd., Cambridge, CB1 2FB UK
- Department of Computer Science, ETH Zurich, Andreasstrasse 5, 8050 Zurich, Switzerland
| | | | - José Jiménez-Luna
- Microsoft Research AI4Science, 21 Station Rd., Cambridge, CB1 2FB UK
| |
Collapse
|
7
|
Smith JP, Milligan K, McCarthy KD, Mchembere W, Okeyo E, Musau SK, Okumu A, Song R, Click ES, Cain KP. Machine learning to predict bacteriologic confirmation of Mycobacterium tuberculosis in infants and very young children. PLOS DIGITAL HEALTH 2023; 2:e0000249. [PMID: 37195976 PMCID: PMC10191346 DOI: 10.1371/journal.pdig.0000249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 04/04/2023] [Indexed: 05/19/2023]
Abstract
Diagnosis of tuberculosis (TB) among young children (<5 years) is challenging due to the paucibacillary nature of clinical disease and clinical similarities to other childhood diseases. We used machine learning to develop accurate prediction models of microbial confirmation with simply defined and easily obtainable clinical, demographic, and radiologic factors. We evaluated eleven supervised machine learning models (using stepwise regression, regularized regression, decision tree, and support vector machine approaches) to predict microbial confirmation in young children (<5 years) using samples from invasive (reference-standard) or noninvasive procedure. Models were trained and tested using data from a large prospective cohort of young children with symptoms suggestive of TB in Kenya. Model performance was evaluated using areas under the receiver operating curve (AUROC) and precision-recall curve (AUPRC), accuracy metrics. (i.e., sensitivity, specificity), F-beta scores, Cohen's Kappa, and Matthew's Correlation Coefficient. Among 262 included children, 29 (11%) were microbially confirmed using any sampling technique. Models were accurate at predicting microbial confirmation in samples obtained from invasive procedures (AUROC range: 0.84-0.90) and from noninvasive procedures (AUROC range: 0.83-0.89). History of household contact with a confirmed case of TB, immunological evidence of TB infection, and a chest x-ray consistent with TB disease were consistently influential across models. Our results suggest machine learning can accurately predict microbial confirmation of M. tuberculosis in young children using simply defined features and increase the bacteriologic yield in diagnostic cohorts. These findings may facilitate clinical decision making and guide clinical research into novel biomarkers of TB disease in young children.
Collapse
Affiliation(s)
- Jonathan P. Smith
- Department of Health Policy and Management, Yale School of Public Health, New Haven, Connecticut, United States of America
- Division of Global HIV and Tuberculosis, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - Kyle Milligan
- Division of Global HIV and Tuberculosis, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
- Peraton, Atlanta, Georgia, United States of America
| | - Kimberly D. McCarthy
- Division of Global HIV and Tuberculosis, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - Walter Mchembere
- Center for Global Health Research, Kenya Medical Research Institute, Kisumu, Kenya
| | - Elisha Okeyo
- Center for Global Health Research, Kenya Medical Research Institute, Kisumu, Kenya
| | - Susan K. Musau
- Center for Global Health Research, Kenya Medical Research Institute, Kisumu, Kenya
| | - Albert Okumu
- Center for Global Health Research, Kenya Medical Research Institute, Kisumu, Kenya
| | - Rinn Song
- Oxford Vaccine Group, Department of Paediatrics, University of Oxford, Oxford, United Kingdom
- Department of Pediatrics, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Eleanor S. Click
- Division of Global HIV and Tuberculosis, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| | - Kevin P. Cain
- Division of Global HIV and Tuberculosis, Centers for Disease Control and Prevention, Atlanta, Georgia, United States of America
| |
Collapse
|
8
|
Siemers FM, Bajorath J. Differences in learning characteristics between support vector machine and random forest models for compound classification revealed by Shapley value analysis. Sci Rep 2023; 13:5983. [PMID: 37045972 PMCID: PMC10097675 DOI: 10.1038/s41598-023-33215-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 04/09/2023] [Indexed: 04/14/2023] Open
Abstract
The random forest (RF) and support vector machine (SVM) methods are mainstays in molecular machine learning (ML) and compound property prediction. We have explored in detail how binary classification models derived using these algorithms arrive at their predictions. To these ends, approaches from explainable artificial intelligence (XAI) are applicable such as the Shapley value concept originating from game theory that we adapted and further extended for our analysis. In large-scale activity-based compound classification using models derived from training sets of increasing size, RF and SVM with the Tanimoto kernel produced very similar predictions that could hardly be distinguished. However, Shapley value analysis revealed that their learning characteristics systematically differed and that chemically intuitive explanations of accurate RF and SVM predictions had different origins.
Collapse
Affiliation(s)
- Friederike Maite Siemers
- B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|