1
|
Zhang R, Nolte D, Sanchez-Villalobos C, Ghosh S, Pal R. Topological regression as an interpretable and efficient tool for quantitative structure-activity relationship modeling. Nat Commun 2024; 15:5072. [PMID: 38871711 DOI: 10.1038/s41467-024-49372-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2023] [Accepted: 06/04/2024] [Indexed: 06/15/2024] Open
Abstract
Quantitative structure-activity relationship (QSAR) modeling is a powerful tool for drug discovery, yet the lack of interpretability of commonly used QSAR models hinders their application in molecular design. We propose a similarity-based regression framework, topological regression (TR), that offers a statistically grounded, computationally fast, and interpretable technique to predict drug responses. We compare the predictive performance of TR on 530 ChEMBL human target activity datasets against the predictive performance of deep-learning-based QSAR models. Our results suggest that our sparse TR model can achieve equal, if not better, performance than the deep learning-based QSAR models and provide better intuitive interpretation by extracting an approximate isometry between the chemical space of the drugs and their activity space.
Collapse
Affiliation(s)
- Ruibo Zhang
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, 79409, USA
| | - Daniel Nolte
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, 79409, USA
| | - Cesar Sanchez-Villalobos
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, 79409, USA
| | - Souparno Ghosh
- Department of Statistics, University of Nebraska - Lincoln, Lincoln, NB, 68588, USA.
| | - Ranadip Pal
- Department of Electrical and Computer Engineering, Texas Tech University, Lubbock, TX, 79409, USA.
| |
Collapse
|
2
|
Zhou Y, Wang Z, Huang Z, Li W, Chen Y, Yu X, Tang Y, Liu G. In silico prediction of ocular toxicity of compounds using explainable machine learning and deep learning approaches. J Appl Toxicol 2024; 44:892-907. [PMID: 38329145 DOI: 10.1002/jat.4586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 01/16/2024] [Accepted: 01/16/2024] [Indexed: 02/09/2024]
Abstract
The accurate identification of chemicals with ocular toxicity is of paramount importance in health hazard assessment. In contemporary chemical toxicology, there is a growing emphasis on refining, reducing, and replacing animal testing in safety evaluations. Therefore, the development of robust computational tools is crucial for regulatory applications. The performance of predictive models is heavily reliant on the quality and quantity of data. In this investigation, we amalgamated the most extensive dataset (4901 compounds) sourced from governmental GHS-compliant databases and literature to develop binary classification models of chemical ocular toxicity. We employed 12 molecular representations in conjunction with six machine learning algorithms and two deep learning algorithms to create a series of binary classification models. The findings indicated that the deep learning method GCN outperformed the machine learning models in cross-validation, achieving an impressive AUC of 0.915. However, the top-performing machine learning model (RF-Descriptor) demonstrated excellent performance with an AUC of 0.869 on the test set and was therefore selected as the best model. To enhance model interpretability, we conducted the SHAP method and attention weights analysis. The two approaches offered visual depictions of the relevance of key descriptors and substructures in predicting ocular toxicity of chemicals. Thus, we successfully struck a delicate balance between data quality and model interpretability, rendering our model valuable for predicting and comprehending potential ocular-toxic compounds in the early stages of drug discovery.
Collapse
Affiliation(s)
- Yiqing Zhou
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Ze Wang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Zejun Huang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Weihua Li
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Yuanting Chen
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Xinxin Yu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Yun Tang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Guixia Liu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| |
Collapse
|
3
|
Lai J, Chen Z, Liu J, Zhu C, Huang H, Yi Y, Cai G, Liao N. A radiogenomic multimodal and whole-transcriptome sequencing for preoperative prediction of axillary lymph node metastasis and drug therapeutic response in breast cancer: a retrospective, machine learning and international multicohort study. Int J Surg 2024; 110:2162-2177. [PMID: 38215256 PMCID: PMC11019980 DOI: 10.1097/js9.0000000000001082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Accepted: 12/27/2023] [Indexed: 01/14/2024]
Abstract
BACKGROUND Axillary lymph nodes (ALN) status serves as a crucial prognostic indicator in breast cancer (BC). The aim of this study was to construct a radiogenomic multimodal model, based on machine learning and whole-transcriptome sequencing (WTS), to accurately evaluate the risk of ALN metastasis (ALNM), drug therapeutic response and avoid unnecessary axillary surgery in BC patients. METHODS In this study, conducted a retrospective analysis of 1078 BC patients from The Cancer Genome Atlas (TCGA), The Cancer Imaging Archive (TCIA), and Foshan cohort. These patients were divided into the TCIA cohort ( N =103), TCIA validation cohort ( N =51), Duke cohort ( N =138), Foshan cohort ( N =106), and TCGA cohort ( N =680). Radiological features were extracted from BC radiological images and differentially expressed gene expression was calibrated using technology. A support vector machine model was employed to screen radiological and genetic features, and a multimodal model was established based on radiogenomic and clinical pathological features to predict ALNM. The accuracy of the model predictions was assessed using the area under the curve (AUC) and the clinical benefit was measured using decision curve analysis. Risk stratification analysis of BC patients was performed by gene set enrichment analysis, differential comparison of immune checkpoint gene expression, and drug sensitivity testing. RESULTS For the prediction of ALNM, rad-score was able to significantly differentiate between ALN- and ALN+ patients in both the Duke and Foshan cohorts ( P <0.05). Similarly, the gene-score was able to significantly differentiate between ALN- and ALN+ patients in the TCGA cohort ( P <0.05). The radiogenomic multimodal nomogram demonstrated satisfactory performance in the TCIA cohort (AUC 0.82, 95% CI: 0.74-0.91) and the TCIA validation cohort (AUC 0.77, 95% CI: 0.63-0.91). In the risk sub-stratification analysis, there were significant differences in gene pathway enrichment between high and low-risk groups ( P <0.05). Additionally, different risk groups may exhibit varying treatment responses ( P <0.05). CONCLUSION Overall, the radiogenomic multimodal model employs multimodal data, including radiological images, genetic, and clinicopathological typing. The radiogenomic multimodal nomogram can precisely predict ALNM and drug therapeutic response in BC patients.
Collapse
Affiliation(s)
- Jianguo Lai
- Department of Breast Cancer, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Yuexiu District, Guangzhou, Guangdong
| | - Zijun Chen
- The Second Clinical School of Southern Medical University, Guangzhou
| | - Jie Liu
- Department of Breast Cancer, Affiliated Foshan Maternity and Child Healthcare Hospital, Southern Medical University
| | - Chao Zhu
- Department of Blood Transfusion, The First Affiliated Hospital of Nanchang University
| | - Haoxuan Huang
- Department of Urology, Third Affiliated Hospital of Nanchang University, Nanchang, Jiangxi, People’s Republic of China
| | - Ying Yi
- Department of Radiology, The First People's Hospital of Foshan, Foshan, Guangdong
| | - Gengxi Cai
- Department of Breast Surgery, The First People’s Hospital of Foshan, Foshan, Guangdong
| | - Ning Liao
- Department of Breast Cancer, Guangdong Provincial People's Hospital (Guangdong Academy of Medical Sciences), Southern Medical University, Yuexiu District, Guangzhou, Guangdong
| |
Collapse
|
4
|
He D, Liu Q, Mi Y, Meng Q, Xu L, Hou C, Wang J, Li N, Liu Y, Chai H, Yang Y, Liu J, Wang L, Hou Y. De Novo Generation and Identification of Novel Compounds with Drug Efficacy Based on Machine Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2307245. [PMID: 38204214 PMCID: PMC10962488 DOI: 10.1002/advs.202307245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 12/05/2023] [Indexed: 01/12/2024]
Abstract
One of the main challenges in small molecule drug discovery is finding novel chemical compounds with desirable activity. Traditional drug development typically begins with target selection, but the correlation between targets and disease remains to be further investigated, and drugs designed based on targets may not always have the desired drug efficacy. The emergence of machine learning provides a powerful tool to overcome the challenge. Herein, a machine learning-based strategy is developed for de novo generation of novel compounds with drug efficacy termed DTLS (Deep Transfer Learning-based Strategy) by using dataset of disease-direct-related activity as input. DTLS is applied in two kinds of disease: colorectal cancer (CRC) and Alzheimer's disease (AD). In each case, novel compound is discovered and identified in in vitro and in vivo disease models. Their mechanism of actionis further explored. The experimental results reveal that DTLS can not only realize the generation and identification of novel compounds with drug efficacy but also has the advantage of identifying compounds by focusing on protein targets to facilitate the mechanism study. This work highlights the significant impact of machine learning on the design of novel compounds with drug efficacy, which provides a powerful new approach to drug discovery.
Collapse
Affiliation(s)
- Dakuo He
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Qing Liu
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Yan Mi
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Qingqi Meng
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Libin Xu
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Chunyu Hou
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Jinpeng Wang
- College of Information Science and EngineeringState Key Laboratory of Synthetical Automation for Process IndustriesNortheastern UniversityShenyang110819China
| | - Ning Li
- School of Traditional Chinese Materia MedicaKey Laboratory for TCM Material Basis Study and Innovative Drug Development of Shenyang CityShenyang Pharmaceutical UniversityShenyang110016China
| | - Yang Liu
- Key Laboratory of Structure‐Based Drug Design & Discovery of Ministry of EducationShenyang Pharmaceutical UniversityShenyang110016China
| | - Huifang Chai
- School of PharmacyGuizhou University of Traditional Chinese MedicineGuiyang550025China
| | - Yanqiu Yang
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Jingyu Liu
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| | - Lihui Wang
- Department of PharmacologyShenyang Pharmaceutical UniversityShenyang110016China
| | - Yue Hou
- Key Laboratory of Bioresource Research and Development of Liaoning ProvinceCollege of Life and Health SciencesNational Frontiers Science Center for Industrial Intelligence and Systems OptimizationNortheastern UniversityShenyang110169China
- Key Laboratory of Data Analytics and Optimization for Smart IndustryMinistry of EducationNortheastern UniversityShenyang110169China
| |
Collapse
|
5
|
Shirasawa R, Takaki K, Miyao T. Generalizability Improvement of Interpretable Symbolic Regression Models for Quantitative Structure-Activity Relationships. ACS OMEGA 2024; 9:9463-9474. [PMID: 38434845 PMCID: PMC10905595 DOI: 10.1021/acsomega.3c09047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 01/20/2024] [Accepted: 01/26/2024] [Indexed: 03/05/2024]
Abstract
In the pursuit of optimal quantitative structure-activity relationship (QSAR) models, two key factors are paramount: the robustness of predictive ability and the interpretability of the model. Symbolic regression (SR) searches for the mathematical expressions that explain a training data set. Thus, the models provided by SR are globally interpretable. We previously proposed an SR method that can generate interpretable expressions by humans. This study introduces an enhanced symbolic regression method, termed filter-induced genetic programming 2 (FIGP2), as an extension of our previously proposed SR method. FIGP2 is designed to improve the generalizability of SR models and to be applicable to data sets in which cost-intensive descriptors are employed. The FIGP2 method incorporates two major improvements: a modified domain filter to eradicate diverging expressions based on optimal calculation and the introduction of a stability metric to penalize expressions that would lead to overfitting. Our retrospective comparative analysis using 12 structure-activity relationship data sets revealed that FIGP2 surpassed the previously proposed SR method and conventional modeling methods, such as support vector regression and multivariate linear regression in terms of predictive performance. Generated mathematical expressions by FIGP2 were relatively simple and not divergent in the domain of function. Taken together, FIGP2 can be used for making interpretable regression models with predictive ability.
Collapse
Affiliation(s)
- Raku Shirasawa
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Advanced Research Laboratory, Technology Infrastructure Center, Technology Platform, Sony Group Corporation, Atsugi Tec., 4-14-1 Asahi-cho, Atsugi-shi, Kanagawa 243-0014, Japan
| | - Katsushi Takaki
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Tomoyuki Miyao
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| |
Collapse
|
6
|
Jasial S, Hu J, Miyao T, Hirama Y, Onishi S, Matsui R, Osaki K, Funatsu K. Screening and Validation of Odorants against Influenza A Virus Using Interpretable Regression Models. ACS Pharmacol Transl Sci 2023; 6:139-150. [PMID: 36654744 PMCID: PMC9841774 DOI: 10.1021/acsptsci.2c00193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Indexed: 12/23/2022]
Abstract
Influenza is a respiratory infection caused by the influenza virus that is prevalent worldwide. One of the most contagious variants of influenza is influenza A virus (IAV), which usually spreads in closed spaces through aerosols. Preventive measures such as novel compounds are needed that can act on viral membranes and provide a safe environment against IAV infection. In this study, we screened compounds with common fragrances that are generally used to mask unpleasant odors but can also exhibit antiviral activity against a strain of IAV. Initially, a set of 188 structurally diverse odorants were collected, and their antiviral activity was measured in vapor phase against the IAV solution. Regression models were built for the prediction of antiviral activity using this set of odorants by taking into account their structural features along with vapor pressure and partition coefficient (n-octanol/water). The models were interpreted using a feature weighting approach and Shapley Additive exPlanations to rationalize the predictions as an additional validation for virtual screening. This model was used to screen odorants from an in-house odorant data set consisting of 2020 odorants, which were later evaluated using in vitro experiments. Out of 11 odorants proposed using the final model, 8 odorants were found to exhibit antiviral activity. The feature interpretation of screened odorants suggested that they contained hydrophilic substructures, such as hydroxyl group, which might contribute to denaturation of proteins on the surface of the virus. These odorants should be explored as a preventive measure in closed spaces to decrease the risk of infections of IAV.
Collapse
Affiliation(s)
- Swarit Jasial
- Data
Science Center and Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
| | - Jieying Hu
- Material
Science Research, Kao Corporation, 1334 Minato, Wakayama-shi, Wakayama640-8580, Japan
| | - Tomoyuki Miyao
- Data
Science Center and Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
| | - Yui Hirama
- Biological
Science Research, Kao Corporation, 2606 Akabane, Ichikai-machi, Haga-gun, Tochigi321-3426, Japan
| | - Shintaro Onishi
- Biological
Science Research, Kao Corporation, 2606 Akabane, Ichikai-machi, Haga-gun, Tochigi321-3426, Japan
| | - Ryoichi Matsui
- Material
Science Research, Kao Corporation, 1334 Minato, Wakayama-shi, Wakayama640-8580, Japan
| | - Koji Osaki
- Material
Science Research, Kao Corporation, 1334 Minato, Wakayama-shi, Wakayama640-8580, Japan
| | - Kimito Funatsu
- Data
Science Center and Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara630-0192, Japan
| |
Collapse
|
7
|
Asahara R, Miyao T. Extended Connectivity Fingerprints as a Chemical Reaction Representation for Enantioselective Organophosphorus-Catalyzed Asymmetric Reaction Prediction. ACS OMEGA 2022; 7:26952-26964. [PMID: 35936487 PMCID: PMC9352214 DOI: 10.1021/acsomega.2c03812] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Accepted: 07/07/2022] [Indexed: 06/15/2023]
Abstract
Predicting the outcomes of organic reactions using data-driven approaches aids in the acceleration of research. In laboratory-scale experiments, only a small number of reaction data can be accessed for machine learning model construction, where reaction representations play a pivotal role in the success of model construction. Nevertheless, representation comparison for a small data set is not adequate. Herein, focusing on the enantioselectivity of phosphoric-acid-catalyzed reactions, various two-dimensional and three-dimensional reaction representations (descriptors) were compared. Overall, the concatenated form of the extended connectivity fingerprints showed the best predictive capability for the two types of data sets: high-throughput experimental data and manually collected literature data sets. Furthermore, highlighting the substructure contribution to the prediction outcome was shown to be informative for guiding catalyst development.
Collapse
Affiliation(s)
- Ryosuke Asahara
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Tomoyuki Miyao
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Data
Science Center, Nara Institute of Science
and Technology, 8916-5
Takayama-cho, Ikoma, Nara 630-0192, Japan
| |
Collapse
|
8
|
Feldmann C, Bajorath J. Calculation of Exact Shapley Values for Support Vector Machines with Tanimoto Kernel Enables Model Interpretation. iScience 2022; 25:105023. [PMID: 36105596 PMCID: PMC9464958 DOI: 10.1016/j.isci.2022.105023] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Revised: 08/09/2022] [Accepted: 08/20/2022] [Indexed: 11/24/2022] Open
Abstract
The support vector machine (SVM) algorithm is popular in chemistry and drug discovery. SVM models have black box character. Their predictions can be interpreted through feature weighting or the model-agnostic Shapley additive explanations (SHAP) formalism that locally approximates Shapley values (SVs) originating from game theory. We introduce an algorithm termed SV-expressed Tanimoto similarity (SVETA) for the exact calculation of SVs to explain SVM models employing the Tanimoto kernel, the gold standard for the assessment of molecular similarity. For a model system, the exact calculation of SVs is demonstrated. In an SVM-based compound classification task from drug discovery, only a limited correlation between exact SV and SHAP values is observed, prohibiting the use of approximate values for rationalizing predictions. For exemplary test compounds, atom-based mapping of prioritized features delineates coherent substructures that closely resemble those obtained by analyzing independently derived random forest models, thus providing consistent explanations. SVETA: new methodology for explaining support vector machine (SVM) predictions Tanimoto similarity-based SVM models are popular in chemistry SVETA enables the calculation of exact Shapley values for rationalizing SVM models SVETA-based feature mapping provides intuitive explanations of SVM decisions
Collapse
|
9
|
Rodríguez-Pérez R, Miljković F, Bajorath J. Machine Learning in Chemoinformatics and Medicinal Chemistry. Annu Rev Biomed Data Sci 2022; 5:43-65. [PMID: 35440144 DOI: 10.1146/annurev-biodatasci-122120-124216] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Novartis Institutes for Biomedical Research, Novartis Campus, Basel, Switzerland
| | - Filip Miljković
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Data Science and AI, Imaging and Data Analytics, Clinical Pharmacology and Safety Sciences, R&D AstraZeneca, Gothenburg, Sweden
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany;
| |
Collapse
|
10
|
Rodríguez-Pérez R, Bajorath J. Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery. J Comput Aided Mol Des 2022; 36:355-362. [PMID: 35304657 PMCID: PMC9325859 DOI: 10.1007/s10822-022-00442-9] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 02/15/2022] [Indexed: 11/05/2022]
Abstract
The support vector machine (SVM) algorithm is one of the most widely used machine learning (ML) methods for predicting active compounds and molecular properties. In chemoinformatics and drug discovery, SVM has been a state-of-the-art ML approach for more than a decade. A unique attribute of SVM is that it operates in feature spaces of increasing dimensionality. Hence, SVM conceptually departs from the paradigm of low dimensionality that applies to many other methods for chemical space navigation. The SVM approach is applicable to compound classification, and ranking, multi-class predictions, and –in algorithmically modified form– regression modeling. In the emerging era of deep learning (DL), SVM retains its relevance as one of the premier ML methods in chemoinformatics, for reasons discussed herein. We describe the SVM methodology including strengths and weaknesses and discuss selected applications that have contributed to the evolution of SVM as a premier approach for compound classification, property predictions, and virtual compound screening.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115, Bonn, Germany.,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002, Basel, Switzerland
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115, Bonn, Germany. .,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002, Basel, Switzerland.
| |
Collapse
|
11
|
Rodríguez-Pérez R, Bajorath J. Explainable Machine Learning for Property Predictions in Compound Optimization. J Med Chem 2021; 64:17744-17752. [PMID: 34902252 DOI: 10.1021/acs.jmedchem.1c01789] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The prediction of compound properties from chemical structure is a main task for machine learning (ML) in medicinal chemistry. ML is often applied to large data sets in applications such as compound screening, virtual library enumeration, or generative chemistry. Albeit desirable, a detailed understanding of ML model decisions is typically not required in these cases. By contrast, compound optimization efforts rely on small data sets to identify structural modifications leading to desired property profiles. In this situation, if ML is applied, one usually is reluctant to make decisions based on predictions that cannot be rationalized. Only few ML methods are interpretable. However, to yield insights into complex ML model decisions, explanatory approaches can be applied. Herein, methodologies for better understanding of ML models or explaining individual predictions are reviewed and current challenges in integrating ML into medicinal chemistry programs as well as future opportunities are discussed.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany.,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| |
Collapse
|
12
|
Feldmann C, Philipps M, Bajorath J. Explainable machine learning predictions of dual-target compounds reveal characteristic structural features. Sci Rep 2021; 11:21594. [PMID: 34732806 PMCID: PMC8566526 DOI: 10.1038/s41598-021-01099-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 10/22/2021] [Indexed: 11/15/2022] Open
Abstract
Compounds with defined multi-target activity play an increasingly important role in drug discovery. Structural features that might be signatures of such compounds have mostly remained elusive thus far. We have explored the potential of explainable machine learning to uncover structural motifs that are characteristic of dual-target compounds. For a pharmacologically relevant target pair-based test system designed for our study, accurate prediction models were derived and the influence of molecular representation features of test compounds was quantified to explain the predictions. The analysis revealed small numbers of specific features whose presence in dual-target and absence in single-target compounds determined accurate predictions. These features formed coherent substructures in dual-target compounds. From computational analysis of specific feature contributions, structural motifs emerged that were confirmed to be signatures of different dual-target activities. Our findings demonstrate the ability of explainable machine learning to bridge between predictions and intuitive chemical analysis and reveal characteristic substructures of dual-target compounds.
Collapse
Affiliation(s)
- Christian Feldmann
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany
| | - Maren Philipps
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany.
| |
Collapse
|
13
|
Tamura S, Jasial S, Miyao T, Funatsu K. Interpretation of Ligand-Based Activity Cliff Prediction Models Using the Matched Molecular Pair Kernel. Molecules 2021; 26:molecules26164916. [PMID: 34443503 PMCID: PMC8401777 DOI: 10.3390/molecules26164916] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Revised: 08/09/2021] [Accepted: 08/10/2021] [Indexed: 11/16/2022] Open
Abstract
Activity cliffs (ACs) are formed by two structurally similar compounds with a large difference in potency. Accurate AC prediction is expected to help researchers' decisions in the early stages of drug discovery. Previously, predictive models based on matched molecular pair (MMP) cliffs have been proposed. However, the proposed methods face a challenge of interpretability due to the black-box character of the predictive models. In this study, we developed interpretable MMP fingerprints and modified a model-specific interpretation approach for models based on a support vector machine (SVM) and MMP kernel. We compared important features highlighted by this SVM-based interpretation approach and the SHapley Additive exPlanations (SHAP) as a major model-independent approach. The model-specific approach could capture the difference between AC and non-AC, while SHAP assigned high weights to the features not present in the test instances. For specific MMPs, the feature weights mapped by the SVM-based interpretation method were in agreement with the previously confirmed binding knowledge from X-ray co-crystal structures, indicating that this method is able to interpret the AC prediction model in a chemically intuitive manner.
Collapse
Affiliation(s)
- Shunsuke Tamura
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan; (S.T.); (S.J.); (T.M.)
| | - Swarit Jasial
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan; (S.T.); (S.J.); (T.M.)
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan
| | - Tomoyuki Miyao
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan; (S.T.); (S.J.); (T.M.)
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan
| | - Kimito Funatsu
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan
- Correspondence: ; Tel.: +81-354-400-396; Fax: +81-743-726-037
| |
Collapse
|
14
|
Rodríguez-Pérez R, Bajorath J. Feature importance correlation from machine learning indicates functional relationships between proteins and similar compound binding characteristics. Sci Rep 2021; 11:14245. [PMID: 34244588 PMCID: PMC8270985 DOI: 10.1038/s41598-021-93771-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 06/30/2021] [Indexed: 11/08/2022] Open
Abstract
Machine learning is widely applied in drug discovery research to predict molecular properties and aid in the identification of active compounds. Herein, we introduce a new approach that uses model-internal information from compound activity predictions to uncover relationships between target proteins. On the basis of a large-scale analysis generating and comparing machine learning models for more than 200 proteins, feature importance correlation analysis is shown to detect similar compound binding characteristics. Furthermore, rather unexpectedly, the analysis also reveals functional relationships between proteins that are independent of active compounds and binding characteristics. Feature importance correlation analysis does not depend on specific representations, algorithms, or metrics and is generally applicable as long as predictive models can be derived. Moreover, the approach does not require or involve explainable or interpretable machine learning, but only access to feature weights or importance values. On the basis of our findings, the approach represents a new facet of machine learning in drug discovery with potential for practical applications.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany
- Novartis Institutes for Biomedical Research, Novartis Campus, 4002, Basel, Switzerland
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany.
| |
Collapse
|
15
|
|
16
|
Feldmann C, Bajorath J. Machine learning reveals that structural features distinguishing promiscuous and non-promiscuous compounds depend on target combinations. Sci Rep 2021; 11:7863. [PMID: 33846469 PMCID: PMC8042106 DOI: 10.1038/s41598-021-87042-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 03/23/2021] [Indexed: 12/20/2022] Open
Abstract
Compounds with defined multi-target activity (promiscuity) play an increasingly important role in drug discovery. However, the molecular basis of multi-target activity is currently only little understood. In particular, it remains unclear whether structural features exist that generally characterize promiscuous compounds and set them apart from compounds with single-target activity. We have devised a test system using machine learning to systematically examine structural features that might characterize compounds with multi-target activity. Using this system, more than 860,000 diagnostic predictions were carried out. The analysis provided compelling evidence for the presence of structural characteristics of promiscuous compounds that were dependent on given target combinations, but not generalizable. Feature weighting and mapping identified characteristic substructures in test compounds. Taken together, these findings are relevant for the design of compounds with desired multi-target activity.
Collapse
Affiliation(s)
- Christian Feldmann
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, 53115, Bonn, Germany.
| |
Collapse
|
17
|
Galati S, Yonchev D, Rodríguez-Pérez R, Vogt M, Tuccinardi T, Bajorath J. Predicting Isoform-Selective Carbonic Anhydrase Inhibitors via Machine Learning and Rationalizing Structural Features Important for Selectivity. ACS OMEGA 2021; 6:4080-4089. [PMID: 33585783 PMCID: PMC7876851 DOI: 10.1021/acsomega.0c06153] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 01/14/2021] [Indexed: 05/03/2023]
Abstract
Carbonic anhydrases (CAs) catalyze the physiological hydration of carbon dioxide and are among the most intensely studied pharmaceutical target enzymes. A hallmark of CA inhibition is the complexation of the catalytic zinc cation in the active site. Human (h) CA isoforms belonging to different families are implicated in a wide range of diseases and of very high interest for therapeutic intervention. Given the conserved catalytic mechanisms and high similarity of many hCA isoforms, a major challenge for CA-based therapy is achieving inhibitor selectivity for hCA isoforms that are associated with specific pathologies over other widely distributed isoforms such as hCA I or hCA II that are of critical relevance for the integrity of many physiological processes. To address this challenge, we have attempted to predict compounds that are selective for isoform hCA IX, which is a tumor-associated protein and implicated in metastasis, over hCA II on the basis of a carefully curated data set of selective and nonselective inhibitors. Machine learning achieved surprisingly high accuracy in predicting hCA IX-selective inhibitors. The results were further investigated, and compound features determining successful predictions were identified. These features were then studied on the basis of X-ray structures of hCA isoform-inhibitor complexes and found to include substructures that explain compound selectivity. Our findings lend credence to selectivity predictions and indicate that the machine learning models derived herein have considerable potential to aid in the identification of new hCA IX-selective compounds.
Collapse
Affiliation(s)
- Salvatore Galati
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
- Department
of Pharmacy, University of Pisa, 56126 Pisa, Italy
| | - Dimitar Yonchev
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| | - Raquel Rodríguez-Pérez
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| | - Martin Vogt
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| | - Tiziano Tuccinardi
- Department
of Pharmacy, University of Pisa, 56126 Pisa, Italy
- . Phone: 39-050-2219595
| | - Jürgen Bajorath
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
- . Phone: 49-228-7369-100
| |
Collapse
|
18
|
Shibayama S, Funatsu K. Industrial Case Study: Identification of Important Substructures and Exploration of Monomers for the Rapid Design of Novel Network Polymers with Distributed Representation. BULLETIN OF THE CHEMICAL SOCIETY OF JAPAN 2021. [DOI: 10.1246/bcsj.20200220] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Shojiro Shibayama
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| | - Kimito Funatsu
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| |
Collapse
|
19
|
Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. J Comput Aided Mol Des 2020; 34:1013-1026. [PMID: 32361862 PMCID: PMC7449951 DOI: 10.1007/s10822-020-00314-0] [Citation(s) in RCA: 146] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 04/24/2020] [Indexed: 02/07/2023]
Abstract
Difficulties in interpreting machine learning (ML) models and their predictions limit the practical applicability of and confidence in ML in pharmaceutical research. There is a need for agnostic approaches aiding in the interpretation of ML models regardless of their complexity that is also applicable to deep neural network (DNN) architectures and model ensembles. To these ends, the SHapley Additive exPlanations (SHAP) methodology has recently been introduced. The SHAP approach enables the identification and prioritization of features that determine compound classification and activity prediction using any ML model. Herein, we further extend the evaluation of the SHAP methodology by investigating a variant for exact calculation of Shapley values for decision tree methods and systematically compare this variant in compound activity and potency value predictions with the model-independent SHAP method. Moreover, new applications of the SHAP analysis approach are presented including interpretation of DNN models for the generation of multi-target activity profiles and ensemble regression models for potency prediction.
Collapse
|
20
|
Data structures for computational compound promiscuity analysis and exemplary applications to inhibitors of the human kinome. J Comput Aided Mol Des 2019; 34:1-10. [DOI: 10.1007/s10822-019-00266-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Accepted: 11/26/2019] [Indexed: 02/05/2023]
|
21
|
Rodríguez-Pérez R, Bajorath J. Interpretation of Compound Activity Predictions from Complex Machine Learning Models Using Local Approximations and Shapley Values. J Med Chem 2019; 63:8761-8777. [PMID: 31512867 DOI: 10.1021/acs.jmedchem.9b01101] [Citation(s) in RCA: 128] [Impact Index Per Article: 25.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
In qualitative or quantitative studies of structure-activity relationships (SARs), machine learning (ML) models are trained to recognize structural patterns that differentiate between active and inactive compounds. Understanding model decisions is challenging but of critical importance to guide compound design. Moreover, the interpretation of ML results provides an additional level of model validation based on expert knowledge. A number of complex ML approaches, especially deep learning (DL) architectures, have distinctive black-box character. Herein, a locally interpretable explanatory method termed Shapley additive explanations (SHAP) is introduced for rationalizing activity predictions of any ML algorithm, regardless of its complexity. Models resulting from random forest (RF), nonlinear support vector machine (SVM), and deep neural network (DNN) learning are interpreted, and structural patterns determining the predicted probability of activity are identified and mapped onto test compounds. The results indicate that SHAP has high potential for rationalizing predictions of complex ML models.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany.,Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397 Biberach an der Riß, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany
| |
Collapse
|
22
|
Walker E, Kammeraad J, Goetz J, Robo MT, Tewari A, Zimmerman PM. Learning To Predict Reaction Conditions: Relationships between Solvent, Molecular Structure, and Catalyst. J Chem Inf Model 2019; 59:3645-3654. [PMID: 31381340 DOI: 10.1021/acs.jcim.9b00313] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Reaction databases provide a great deal of useful information to assist planning of experiments but do not provide any interpretation or chemical concepts to accompany this information. In this work, reactions are labeled with experimental conditions, and network analysis shows that consistencies within clusters of data points can be leveraged to organize this information. In particular, this analysis shows how particular experimental conditions (specifically solvent) are effective in enabling specific organic reactions (Friedel-Crafts, Aldol addition, Claisen condensation, Diels-Alder, and Wittig), including variations within each reaction class. Network analysis shows data points for reactions tend to break into clusters that depend on the catalyst and chemical structure. This type of clustering, which mimics how a chemist reasons, is derived directly from the network. Therefore, the findings of this work could augment synthesis planning by providing predictions in a fashion that mimics human chemists. To numerically evaluate solvent prediction ability, three methods are compared: network analysis (through the k-nearest neighbor algorithm), a support vector machine, and a deep neural network. The most accurate method in 4 of the 5 test cases is the network analysis, with deep neural networks also showing good prediction scores. The network analysis tool was evaluated by an expert panel of chemists, who generally agreed that the algorithm produced accurate solvent choices while simultaneously being transparent in the underlying reasons for its predictions.
Collapse
Affiliation(s)
- Eric Walker
- Department of Chemistry , University of Michigan , 930 North University Avenue , Ann Arbor , Michigan 48109 , United States
| | - Joshua Kammeraad
- Department of Chemistry , University of Michigan , 930 North University Avenue , Ann Arbor , Michigan 48109 , United States
| | - Jonathan Goetz
- Department of Statistics , University of Michigan , 1085 South University Avenue , Ann Arbor , Michigan 48109 , United States
| | - Michael T Robo
- Department of Chemistry , University of Michigan , 930 North University Avenue , Ann Arbor , Michigan 48109 , United States
| | - Ambuj Tewari
- Department of Statistics , University of Michigan , 1085 South University Avenue , Ann Arbor , Michigan 48109 , United States
| | - Paul M Zimmerman
- Department of Chemistry , University of Michigan , 930 North University Avenue , Ann Arbor , Michigan 48109 , United States
| |
Collapse
|
23
|
Yang X, Wang Y, Byrne R, Schneider G, Yang S. Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. Chem Rev 2019; 119:10520-10594. [PMID: 31294972 DOI: 10.1021/acs.chemrev.8b00728] [Citation(s) in RCA: 343] [Impact Index Per Article: 68.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.
Collapse
Affiliation(s)
- Xin Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Yifei Wang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| | - Ryan Byrne
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Gisbert Schneider
- ETH Zurich , Department of Chemistry and Applied Biosciences , Vladimir-Prelog-Weg 4 , CH-8093 Zurich , Switzerland
| | - Shengyong Yang
- State Key Laboratory of Biotherapy and Cancer Center, West China Hospital , Sichuan University , Chengdu , Sichuan 610041 , China
| |
Collapse
|
24
|
Maltarollo VG, Kronenberger T, Espinoza GZ, Oliveira PR, Honorio KM. Advances with support vector machines for novel drug discovery. Expert Opin Drug Discov 2018; 14:23-33. [PMID: 30488731 DOI: 10.1080/17460441.2019.1549033] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
INTRODUCTION Novel drug discovery remains an enormous challenge, with various computer-aided drug design (CADD) approaches having been widely employed for this purpose. CADD, specifically the commonly used support vector machines (SVMs), can employ machine learning techniques. SVMs and their variations offer numerous drug discovery applications, which range from the classification of substances (as active or inactive) to the construction of regression models and the ranking/virtual screening of databased compounds. Areas covered: Herein, the authors consider some of the applications of SVMs in medicinal chemistry, illustrating their main advantages and disadvantages, as well as trends in their utilization, via the available published literature. The aim of this review is to provide an up-to-date review of the recent applications of SVMs in drug discovery as described by the literature, thereby highlighting their strengths, weaknesses, and future challenges. Expert opinion: Techniques based on SVMs are considered as powerful approaches in early drug discovery. The ability of SVMs to classify active or inactive compounds has enabled the prioritization of substances for virtual screening. Indeed, one of the main advantages of SVMs is related to their potential in the analysis of nonlinear problems. However, despite successes in employing SVMs, the challenges of improving accuracy remain.
Collapse
Affiliation(s)
- Vinicius Gonçalves Maltarollo
- a Departamento de Produtos Farmacêuticos, Faculdade de Farmácia , Universidade Federal de Minas Gerais , Belo Horizonte , Brazil
| | - Thales Kronenberger
- b Department of Internal Medicine VIII , University Hospital of Tübingen , Tübingen , Germany
| | - Gabriel Zarzana Espinoza
- c Escola de Artes, Ciências e Humanidades , Universidade de São Paulo (USP) , São Paulo , Brazil
| | - Patricia Rufino Oliveira
- c Escola de Artes, Ciências e Humanidades , Universidade de São Paulo (USP) , São Paulo , Brazil
| | - Kathia Maria Honorio
- c Escola de Artes, Ciências e Humanidades , Universidade de São Paulo (USP) , São Paulo , Brazil.,d Centro de Ciências Naturais e Humanas , Universidade Federal do ABC , Santo André , Brazil
| |
Collapse
|
25
|
Jasial S, Gilberg E, Blaschke T, Bajorath J. Machine Learning Distinguishes with High Accuracy between Pan-Assay Interference Compounds That Are Promiscuous or Represent Dark Chemical Matter. J Med Chem 2018; 61:10255-10264. [DOI: 10.1021/acs.jmedchem.8b01404] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Affiliation(s)
- Swarit Jasial
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Endenicher Allee 19c, Rheinische Friedrich-Wilhelms-Universität, D-53115 Bonn, Germany
| | - Erik Gilberg
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Endenicher Allee 19c, Rheinische Friedrich-Wilhelms-Universität, D-53115 Bonn, Germany
| | - Thomas Blaschke
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Endenicher Allee 19c, Rheinische Friedrich-Wilhelms-Universität, D-53115 Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Endenicher Allee 19c, Rheinische Friedrich-Wilhelms-Universität, D-53115 Bonn, Germany
| |
Collapse
|
26
|
Rodríguez-Pérez R, Vogt M, Bajorath J. Support Vector Machine Classification and Regression Prioritize Different Structural Features for Binary Compound Activity and Potency Value Prediction. ACS OMEGA 2017; 2:6371-6379. [PMID: 30023518 PMCID: PMC6045367 DOI: 10.1021/acsomega.7b01079] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2017] [Accepted: 09/22/2017] [Indexed: 05/15/2023]
Abstract
In computational chemistry and chemoinformatics, the support vector machine (SVM) algorithm is among the most widely used machine learning methods for the identification of new active compounds. In addition, support vector regression (SVR) has become a preferred approach for modeling nonlinear structure-activity relationships and predicting compound potency values. For the closely related SVM and SVR methods, fingerprints (i.e., bit string or feature set representations of chemical structure and properties) are generally preferred descriptors. Herein, we have compared SVM and SVR calculations for the same compound data sets to evaluate which features are responsible for predictions. On the basis of systematic feature weight analysis, rather surprising results were obtained. Fingerprint features were frequently identified that contributed differently to the corresponding SVM and SVR models. The overlap between feature sets determining the predictive performance of SVM and SVR was only very small. Furthermore, features were identified that had opposite effects on SVM and SVR predictions. Feature weight analysis in combination with feature mapping made it also possible to interpret individual predictions, thus balancing the black box character of SVM/SVR modeling.
Collapse
|
27
|
Rensi SE, Altman RB. Shallow Representation Learning via Kernel PCA Improves QSAR Modelability. J Chem Inf Model 2017; 57:1859-1867. [PMID: 28727421 DOI: 10.1021/acs.jcim.6b00694] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
Abstract
Linear models offer a robust, flexible, and computationally efficient set of tools for modeling quantitative structure-activity relationships (QSARs) but have been eclipsed in performance by nonlinear methods. Support vector machines (SVMs) and neural networks are currently among the most popular and accurate QSAR methods because they learn new representations of the data that greatly improve modelability. In this work, we use shallow representation learning to improve the accuracy of L1 regularized logistic regression (LASSO) and meet the performance of Tanimoto SVM. We embedded chemical fingerprints in Euclidean space using Tanimoto (a.k.a. Jaccard) similarity kernel principal component analysis (KPCA) and compared the effects on LASSO and SVM model performance for predicting the binding activities of chemical compounds against 102 virtual screening targets. We observed similar performance and patterns of improvement for LASSO and SVM. We also empirically measured model training and cross-validation times to show that KPCA used in concert with LASSO classification is significantly faster than linear SVM over a wide range of training set sizes. Our work shows that powerful linear QSAR methods can match nonlinear methods and demonstrates a modular approach to nonlinear classification that greatly enhances QSAR model prototyping facility, flexibility, and transferability.
Collapse
Affiliation(s)
- Stefano E Rensi
- Department of Bioengineering, Stanford University , Shriram Center, Room 213, 443 Via Ortega MC 4245, Stanford, California 94305, United States
| | - Russ B Altman
- Department of Bioengineering, Stanford University , Shriram Center, Room 213, 443 Via Ortega MC 4245, Stanford, California 94305, United States
| |
Collapse
|
28
|
Marchese Robinson RL, Palczewska A, Palczewski J, Kidley N. Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets. J Chem Inf Model 2017; 57:1773-1792. [PMID: 28715209 DOI: 10.1021/acs.jcim.6b00753] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The ability to interpret the predictions made by quantitative structure-activity relationships (QSARs) offers a number of advantages. While QSARs built using nonlinear modeling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modeling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting nonlinear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to those of two widely used linear modeling approaches: linear Support Vector Machines (SVMs) (or Support Vector Regression (SVR)) and partial least-squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions using novel scoring schemes for assessing heat map images of substructural contributions. We critically assess different approaches for interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed public-domain benchmark data sets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modeling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpretation of nonlinear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using open-source programs that we have made available to the community. These programs are the rfFC package ( https://r-forge.r-project.org/R/?group_id=1725 ) for the R statistical programming language and the Python program HeatMapWrapper [ https://doi.org/10.5281/zenodo.495163 ] for heat map generation.
Collapse
Affiliation(s)
- Richard L Marchese Robinson
- Syngenta Ltd., Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, United Kingdom.,School of Pharmacy and Biomolecular Sciences, Liverpool John Moores University , James Parsons Building, Byrom Street, Liverpool L3 3AF, United Kingdom
| | - Anna Palczewska
- Department of Computing, University of Bradford , Bradford BD7 1DP, United Kingdom
| | - Jan Palczewski
- School of Mathematics, University of Leeds , Leeds LS2 9JT, United Kingdom
| | - Nathan Kidley
- Syngenta Ltd., Jealott's Hill International Research Centre , Bracknell, Berkshire RG42 6EY, United Kingdom
| |
Collapse
|
29
|
Yuan H, Chen CN, Li MY, Cao CZ. Recognition of nucleophilic substitution reaction mechanisms of carboxylic esters based on support vector machine. J PHYS ORG CHEM 2016. [DOI: 10.1002/poc.3658] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Affiliation(s)
- Hua Yuan
- Key Laboratory of Theoretical Organic Chemistry and Functional Molecule; Ministry of Education; Key Laboratory of QSAR/QSPR of Hunan Provincial University; School of Chemistry and Chemical Engineering; Hunan University of Science and Technology; Xiangtan China
| | | | | | | |
Collapse
|