1
|
Siddiqui H, Usmani T. Interpretable AI and Machine Learning Classification for Identifying High-Efficiency Donor-Acceptor Pairs in Organic Solar Cells. ACS OMEGA 2024; 9:34445-34455. [PMID: 39157121 PMCID: PMC11325493 DOI: 10.1021/acsomega.4c02157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 06/08/2024] [Accepted: 06/13/2024] [Indexed: 08/20/2024]
Abstract
To enhance the efficiency of organic solar cells, accurately predicting the efficiency of new pairs of donor and acceptor materials is crucial. Presently, most machine learning studies rely on regression models, which often struggle to establish clear rules for distinguishing between high- and low-performing donor-acceptor pairs. This study proposes a novel approach by integrating interpretable AI, specifically using Shapely values, with four supervised machine learning classification models, namely, support vector machines, decision trees, random forest, and gradient boosting. These models aim to identify high-efficiency donor-acceptor pairs based solely on chemical structures and to extract important features that establish general design principles for distinguishing between high- and low-efficiency pairs. For validation purposes, an unsupervised machine learning algorithm utilizing loading vectors obtained from the principal component analysis is employed to identify crucial features associated with high-efficiency donor-acceptor pairs. Interestingly, the features identified by the supervised machine learning approach were found to be a subset of those identified by the unsupervised method. Noteworthy features include the van der Waals surface area, partial equalization of orbital electronegativity, Moreau-Broto autocorrelation, and molecular substructures. Leveraging these features, a backward-working model can be developed, facilitating exploration across a wide array of materials used in organic solar cells. This innovative approach will help navigate the vast chemical compound space of donor and acceptor materials essential in creating high-efficiency organic solar cells.
Collapse
Affiliation(s)
- Hamza Siddiqui
- Organic PV Lab, Integral University, Lucknow 226026, India
| | - Tahsin Usmani
- Organic PV Lab, Integral University, Lucknow 226026, India
| |
Collapse
|
2
|
Huang Z, Lou S, Wang H, Li W, Liu G, Tang Y. AttentiveSkin: To Predict Skin Corrosion/Irritation Potentials of Chemicals via Explainable Machine Learning Methods. Chem Res Toxicol 2024; 37:361-373. [PMID: 38294881 DOI: 10.1021/acs.chemrestox.3c00332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2024]
Abstract
Skin Corrosion/Irritation (Corr./Irrit.) has long been a health hazard in the Globally Harmonized System (GHS). Several in silico models have been built to predict Skin Corr./Irrit. as an alternative to the increasingly restricted animal testing. However, current studies are limited by data amount/quality and model availability. To address these issues, we compiled a traceable consensus GHS data set comprising 731 Corr., 1283 Irrit., and 1205 negative (Neg.) samples from 6 governmental databases and 2 external data sets. Then, a series of binary classifiers were developed with five machine learning (ML) algorithms and six molecular representations. For 10-fold cross-validation, the best Corr. vs Neg. classifier achieved an Area Under the Receiver Operating Characteristic Curve (AUC) of 97.1%, while the best Irrit. vs Neg. classifier achieved an AUC of 84.7%. Compared with existing in silico tools on external validation, our Attentive FP classifiers showed the highest metrics on Corr. vs Neg. and the second highest accuracy on Irrit. vs Neg. The SHapley Additive exPlanation approach was further applied to figure out important molecular features, and the attention weights were visualized to perform interpretable prediction. Structural alerts associated with Skin Corr./Irrit. were also identified. The interpretable Attentive FP classifiers were integrated into the software AttentiveSkin at https://github.com/BeeBeeWong/AttentiveSkin. The conventional ML classifiers are also provided on our platform admetSAR at http://lmmd.ecust.edu.cn/admetsar2/. Considering the data deficiency and the limited model availability of Skin Corr./Irrit., we believe that our data set and models could facilitate chemical safety assessment and relevant studies.
Collapse
Affiliation(s)
- Zejun Huang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Shang Lou
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Haoqiang Wang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Weihua Li
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Guixia Liu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Yun Tang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| |
Collapse
|
3
|
Arturi K, Hollender J. Machine Learning-Based Hazard-Driven Prioritization of Features in Nontarget Screening of Environmental High-Resolution Mass Spectrometry Data. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:18067-18079. [PMID: 37279189 PMCID: PMC10666537 DOI: 10.1021/acs.est.3c00304] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 05/15/2023] [Accepted: 05/15/2023] [Indexed: 06/08/2023]
Abstract
Nontarget high-resolution mass spectrometry screening (NTS HRMS/MS) can detect thousands of organic substances in environmental samples. However, new strategies are needed to focus time-intensive identification efforts on features with the highest potential to cause adverse effects instead of the most abundant ones. To address this challenge, we developed MLinvitroTox, a machine learning framework that uses molecular fingerprints derived from fragmentation spectra (MS2) for a rapid classification of thousands of unidentified HRMS/MS features as toxic/nontoxic based on nearly 400 target-specific and over 100 cytotoxic endpoints from ToxCast/Tox21. Model development results demonstrated that using customized molecular fingerprints and models, over a quarter of toxic endpoints and the majority of the associated mechanistic targets could be accurately predicted with sensitivities exceeding 0.95. Notably, SIRIUS molecular fingerprints and xboost (Extreme Gradient Boosting) models with SMOTE (Synthetic Minority Oversampling Technique) for handling data imbalance were a universally successful and robust modeling configuration. Validation of MLinvitroTox on MassBank spectra showed that toxicity could be predicted from molecular fingerprints derived from MS2 with an average balanced accuracy of 0.75. By applying MLinvitroTox to environmental HRMS/MS data, we confirmed the experimental results obtained with target analysis and narrowed the analytical focus from tens of thousands of detected signals to 783 features linked to potential toxicity, including 109 spectral matches and 30 compounds with confirmed toxic activity.
Collapse
Affiliation(s)
- Katarzyna Arturi
- Department
of Environmental Chemistry, Swiss Federal
Institute of Aquatic Science and Technology (Eawag), Ueberlandstrasse 133, 8600 Dübendorf, Switzerland
| | - Juliane Hollender
- Department
of Environmental Chemistry, Swiss Federal
Institute of Aquatic Science and Technology (Eawag), Ueberlandstrasse 133, 8600 Dübendorf, Switzerland
- Institute
of Biogeochemistry and Pollution Dynamics, Eidgenössische Technische Hochschule Zürich (ETH Zurich), Rämistrasse 101, 8092 Zürich, Switzerland
| |
Collapse
|
4
|
Joeres R, Bojar D, Kalinina OV. GlyLES: Grammar-based Parsing of Glycans from IUPAC-condensed to SMILES. J Cheminform 2023; 15:37. [PMID: 36959676 PMCID: PMC10035253 DOI: 10.1186/s13321-023-00704-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 02/18/2023] [Indexed: 03/25/2023] Open
Abstract
Glycans are important polysaccharides on cellular surfaces that are bound to glycoproteins and glycolipids. These are one of the most common post-translational modifications of proteins in eukaryotic cells. They play important roles in protein folding, cell-cell interactions, and other extracellular processes. Changes in glycan structures may influence the course of different diseases, such as infections or cancer. Glycans are commonly represented using the IUPAC-condensed notation. IUPAC-condensed is a textual representation of glycans operating on the same topological level as the Symbol Nomenclature for Glycans (SNFG) that assigns colored, geometrical shapes to the main monomers. These symbols are then connected in tree-like structures, visualizing the glycan structure on a topological level. Yet for a representation on the atomic level, notations such as SMILES should be used. To our knowledge, there is no easy-to-use, general, open-source, and offline tool to convert the IUPAC-condensed notation to SMILES. Here, we present the open-access Python package GlyLES for the generalizable generation of SMILES representations out of IUPAC-condensed representations. GlyLES uses a grammar to read in the monomer tree from the IUPAC-condensed notation. From this tree, the tool can compute the atomic structures of each monomer based on their IUPAC-condensed descriptions. In the last step, it merges all monomers into the atomic structure of a glycan in the SMILES notation. GlyLES is the first package that allows conversion from the IUPAC-condensed notation of glycans to SMILES strings. This may have multiple applications, including straightforward visualization, substructure search, molecular modeling and docking, and a new featurization strategy for machine-learning algorithms. GlyLES is available at https://github.com/kalininalab/GlyLES.
Collapse
Affiliation(s)
- Roman Joeres
- grid.7490.a0000 0001 2238 295XHelmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbruecken, Germany
- grid.11749.3a0000 0001 2167 7588Center for Bioinformatics, Saarland University, Saarbruecken, Germany
| | - Daniel Bojar
- grid.8761.80000 0000 9919 9582Department of Chemistry and Molecular Biology, University of Gothenburg, Gothenburg, Sweden
- grid.8761.80000 0000 9919 9582Wallenberg Centre for Molecular and Translational Medicine, University of Gothenburg, Gothenburg, Sweden
| | - Olga V. Kalinina
- grid.7490.a0000 0001 2238 295XHelmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Saarbruecken, Germany
- grid.11749.3a0000 0001 2167 7588Center for Bioinformatics, Saarland University, Saarbruecken, Germany
- grid.11749.3a0000 0001 2167 7588Faculty of Medicine, Saarland University, Homburg, Germany
| |
Collapse
|
5
|
Design of short peptides and peptide amphiphiles as collagen mimics and an investigation of their interactions with collagen using molecular dynamics simulations and docking studies. J Mol Model 2022; 29:19. [PMID: 36565373 DOI: 10.1007/s00894-022-05419-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 12/13/2022] [Indexed: 12/25/2022]
Abstract
Short peptide sequences and bolaamphiphiles derived from natural proteins are gaining importance due to their ability to form unique nanoscale architectures for a variety of biological applications. In this work, we have designed six short peptides (triplet or monomeric forms) and two peptide bolaamphiphiles that either incorporate the bioactive collagen motif (Gly-X-Y) or sequences where Gly, Pro, or hydroxyproline (Hyp) are replaced by Ala or His. For the bolaamphiphiles, a malate moiety was used as the aliphatic linker for connecting His with Hyp to create collagen mimics. Stability of the assemblies was assessed through molecular dynamics simulations and results indicated that (Pro-Ala-His)3 and (Ala-His-Hyp)3 formed the most stable structures, while the amphiphiles and the monomers showed some disintegration over the course of the 200 ns simulation, though most regained structural integrity and formed fibrillar structures, and micelles by the end of the simulation, likely due to the formation of more thermodynamically stable conformations. Multiple replica simulations (REMD) were also conducted where the sequences were simulated at different temperatures. Our results showed excellent convergence in most cases compared to constant temperature molecular dynamics simulation. Furthermore, molecular docking and MD simulations of the sequences bound to collagen triple helix structure revealed that several of the sequences had a high binding affinity and formed stable complexes, particularly (Pro-Ala-His)3 and (Ala-His-Hyp)3. Thus, we have designed new hybrid-peptide-based sequences which may be developed for potential applications as biomaterials for tissue engineering or drug delivery.
Collapse
|
6
|
Dolfus U, Briem H, Rarey M. Visualizing Generic Reaction Patterns. J Chem Inf Model 2022; 62:4680-4689. [PMID: 36169383 DOI: 10.1021/acs.jcim.2c00992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Reaction schemes for organic molecules play a crucial role in modern in silico drug design processes. In contrast to the classical drawn reaction diagrams, computational chemists prefer SMARTS based line notations due to a substantially increased expressiveness and precision. They are used to search databases, calculate synthesizability, generate new molecules, or simulate novel reactions. Working with computer-readable representations of reaction schemes can be challenging due to the complexity of the features to be represented. Line representations of reaction schemes can often be cryptic, even to experienced users. To simplify the work with Reaction SMARTS for synthetic, computational, and medicinal chemists, we introduce a visualization technique for reaction schemes and provide a respective tool, called ReactionViewer. ReactionViewer is able to convert reaction schemes encoded as Reaction SMILES, Reaction SMARTS, or SMIRKS into a visual representation. The visualization technique is based on the concept of structure diagrams and follows IUPAC's "Compendium of Chemical Terminology" definition of chemical reaction equations for the reaction symbols. We demonstrate the applicability of the method using two data sets of organic synthesis reaction schemes taken from recent publications. We discuss various properties of the visualization and highlight its readability and interpretability.
Collapse
Affiliation(s)
- Uschi Dolfus
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
| | - Hans Briem
- Bayer AG, Research and Development, Pharmaceuticals, Computational Molecular Design Berlin, Building S110, 711, 13342 Berlin, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
| |
Collapse
|
7
|
Molecular insights on ABL kinase activation using tree-based machine learning models and molecular docking. Mol Divers 2021; 25:1301-1314. [PMID: 34191245 PMCID: PMC8241884 DOI: 10.1007/s11030-021-10261-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 06/18/2021] [Indexed: 12/14/2022]
Abstract
Abelson kinase (c-Abl) is a non-receptor tyrosine kinase involved in several biological processes essential for cell differentiation, migration, proliferation, and survival. This enzyme's activation might be an alternative strategy for treating diseases such as neutropenia induced by chemotherapy, prostate, and breast cancer. Recently, a series of compounds that promote the activation of c-Abl has been identified, opening a promising ground for c-Abl drug development. Structure-based drug design (SBDD) and ligand-based drug design (LBDD) methodologies have significantly impacted recent drug development initiatives. Here, we combined SBDD and LBDD approaches to characterize critical chemical properties and interactions of identified c-Abl's activators. We used molecular docking simulations combined with tree-based machine learning models—decision tree, AdaBoost, and random forest to understand the c-Abl activators' structural features required for binding to myristoyl pocket, and consequently, to promote enzyme and cellular activation. We obtained predictive and robust models with Matthews correlation coefficient values higher than 0.4 for all endpoints and identified characteristics that led to constructing a structure–activity relationship model (SAR).
Collapse
|
8
|
Ehrt C, Krause B, Schmidt R, Ehmki ESR, Rarey M. SMARTS.plus - A Toolbox for Chemical Pattern Design. Mol Inform 2020; 39:e2000216. [PMID: 32997890 PMCID: PMC7757167 DOI: 10.1002/minf.202000216] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 09/28/2020] [Indexed: 11/06/2022]
Abstract
The number of publications concerning Pan-Assay Interference Compounds and related problematic structural motifs in screening libraries is constantly growing. In consequence, filter collections are merged, extended but also critically discussed. Due to the complexity of the chemical pattern language SMARTS, an easy-to-use toolbox enabling every chemist to understand, design and modify chemical patterns is urgently needed. Over the past decade, we developed a series of software tools for visualizing, editing, creating, and analysing chemical patterns. Herein, we highlight how most of these tools can now be easily used as part of the novel SMARTS.plus web server (https://smarts.plus/). As a showcase, we demonstrate how researchers can apply the web server tools within minutes to derive novel SMARTS patterns for the filtering of frequent hitters from their screening libraries with only a little experience with the SMARTS language.
Collapse
Affiliation(s)
- Christiane Ehrt
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146, Hamburg, Germany
| | - Bennet Krause
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146, Hamburg, Germany
| | - Robert Schmidt
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146, Hamburg, Germany
| | - Emanuel S R Ehmki
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146, Hamburg, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146, Hamburg, Germany
| |
Collapse
|