1
|
García-Ortegón M, Seal S, Rasmussen C, Bender A, Bacallado S. Graph neural processes for molecules: an evaluation on docking scores and strategies to improve generalization. J Cheminform 2024; 16:115. [PMID: 39443970 PMCID: PMC11515514 DOI: 10.1186/s13321-024-00904-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 09/13/2024] [Indexed: 10/25/2024] Open
Abstract
Neural processes (NPs) are models for meta-learning which output uncertainty estimates. So far, most studies of NPs have focused on low-dimensional datasets of highly-correlated tasks. While these homogeneous datasets are useful for benchmarking, they may not be representative of realistic transfer learning. In particular, applications in scientific research may prove especially challenging due to the potential novelty of meta-testing tasks. Molecular property prediction is one such research area that is characterized by sparse datasets of many functions on a shared molecular space. In this paper, we study the application of graph NPs to molecular property prediction with DOCKSTRING, a diverse dataset of docking scores. Graph NPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in chemoinformatics, as well as alternative techniques for transfer learning and meta-learning. In order to increase meta-generalization to divergent test functions, we propose fine-tuning strategies that adapt the parameters of NPs. We find that adaptation can substantially increase NPs' regression performance while maintaining good calibration of uncertainty estimates. Finally, we present a Bayesian optimization experiment which showcases the potential advantages of NPs over Gaussian processes in iterative screening. Overall, our results suggest that NPs on molecular graphs hold great potential for molecular property prediction in the low-data setting. SCIENTIFIC CONTRIBUTION: Neural processes are a family of meta-learning algorithms which deal with data scarcity by transferring information across tasks and making probabilistic predictions. We evaluate their performance on regression and optimization molecular tasks using docking scores, finding them to outperform classical single-task and transfer-learning models. We examine the issue of generalization to divergent test tasks, which is a general concern of meta-learning algorithms in science, and propose strategies to alleviate it.
Collapse
Affiliation(s)
- Miguel García-Ortegón
- Statistical Laboratory, University of Cambridge, Wilberforce Rd, Cambridge, CB3 0WA, UK.
- Department of Engineering, University of Cambridge, Trumpington St, Cambridge, CB2 1PZ, UK.
- Department of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW, UK.
| | - Srijit Seal
- Imaging Platform, Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, USA
| | - Carl Rasmussen
- Department of Engineering, University of Cambridge, Trumpington St, Cambridge, CB2 1PZ, UK
| | - Andreas Bender
- Department of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW, UK
| | - Sergio Bacallado
- Statistical Laboratory, University of Cambridge, Wilberforce Rd, Cambridge, CB3 0WA, UK
| |
Collapse
|
2
|
Zhou W, Zhou Y, Zhang X, Huang T, Zhang R, Li D, Xie X, Wang Y, Xu M. Development and Validation of an Explainable Machine Learning Model for Identification of Hyper-Functioning Parathyroid Glands from High-Frequency Ultrasonographic Images. ULTRASOUND IN MEDICINE & BIOLOGY 2024; 50:1506-1514. [PMID: 39054242 DOI: 10.1016/j.ultrasmedbio.2024.05.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 04/25/2024] [Accepted: 05/30/2024] [Indexed: 07/27/2024]
Abstract
OBJECTIVE To develop and validate a machine learning (ML) model based on high-frequency ultrasound (HFUS) images with the aim to identify the functional status of parathyroid glands (PTGs) in secondary hyper-parathyroidism (SHPT) patients. METHODS This retrospective study enrolled 60 SHPT patients (27 female, 33 male; mean age: 51.2 years) with 184 PTGs detected from February 2016 to June 2022. All enrollments underwent single-photon emission computed tomography/computed tomography and contrast-enhanced ultrasound examinations. The PTGs were randomly divided into training (n = 147) and testing datasets (n = 37). Four effective ML classifiers were used and combined models incorporating multi-modal HFUS visual signs and radiomics features was constructed based on the optimal classifier. Model performance was compared in terms of discrimination, calibration and clinical utility. The Shapley additive explanation method was used to explain and visualize the main predictors of the optimal model. RESULTS This model, using a random forest classifier algorithm, outperformed other classifiers. Based on optimal classifier features, the model constructed from ultrasound visual and ML features achieved a favorable performance in the prediction of hyper-functioning PTGs. Compared with the traditional visual model, the ultrasound-based ML model achieved significant (p = 0.03) improvement (area under the curve: 0.859 vs. 0.629) and higher sensitivity (100.0% vs. 94.1%) and accuracy (86.5% vs. 67.6%). Among the predictors attributed to model development, large size and high echogenic heterogeneity of PTGs in ultrasonographic images were more often associated with high risk of hyper-functioning PTGs. CONCLUSION The ultrasound-based ML model for identifying hyper-functioning PTGs in SHPT patients showed good performance and interpretability using high-frequency ultrasonographic images, which may facilitate clinical management.
Collapse
Affiliation(s)
- Wenwen Zhou
- Department of Medical Ultrasound, Institute for Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Yu Zhou
- National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, 518055, China
| | - Xiaoer Zhang
- Department of Medical Ultrasound, Institute for Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Tongyi Huang
- Department of Medical Ultrasound, Institute for Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Rui Zhang
- Department of Medical Ultrasound, Institute for Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Di Li
- Department of Medical Ultrasound, Institute for Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Xiaoyan Xie
- Department of Medical Ultrasound, Institute for Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| | - Yi Wang
- National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen, 518055, China.
| | - Ming Xu
- Department of Medical Ultrasound, Institute for Diagnostic and Interventional Ultrasound, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510080, China
| |
Collapse
|
3
|
Heyndrickx W, Mervin L, Morawietz T, Sturm N, Friedrich L, Zalewski A, Pentina A, Humbeck L, Oldenhof M, Niwayama R, Schmidtke P, Fechner N, Simm J, Arany A, Drizard N, Jabal R, Afanasyeva A, Loeb R, Verma S, Harnqvist S, Holmes M, Pejo B, Telenczuk M, Holway N, Dieckmann A, Rieke N, Zumsande F, Clevert DA, Krug M, Luscombe C, Green D, Ertl P, Antal P, Marcus D, Do Huu N, Fuji H, Pickett S, Acs G, Boniface E, Beck B, Sun Y, Gohier A, Rippmann F, Engkvist O, Göller AH, Moreau Y, Galtier MN, Schuffenhauer A, Ceulemans H. MELLODDY: Cross-pharma Federated Learning at Unprecedented Scale Unlocks Benefits in QSAR without Compromising Proprietary Information. J Chem Inf Model 2024; 64:2331-2344. [PMID: 37642660 PMCID: PMC11005050 DOI: 10.1021/acs.jcim.3c00799] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Indexed: 08/31/2023]
Abstract
Federated multipartner machine learning has been touted as an appealing and efficient method to increase the effective training data volume and thereby the predictivity of models, particularly when the generation of training data is resource-intensive. In the landmark MELLODDY project, indeed, each of ten pharmaceutical companies realized aggregated improvements on its own classification or regression models through federated learning. To this end, they leveraged a novel implementation extending multitask learning across partners, on a platform audited for privacy and security. The experiments involved an unprecedented cross-pharma data set of 2.6+ billion confidential experimental activity data points, documenting 21+ million physical small molecules and 40+ thousand assays in on-target and secondary pharmacodynamics and pharmacokinetics. Appropriate complementary metrics were developed to evaluate the predictive performance in the federated setting. In addition to predictive performance increases in labeled space, the results point toward an extended applicability domain in federated learning. Increases in collective training data volume, including by means of auxiliary data resulting from single concentration high-throughput and imaging assays, continued to boost predictive performance, albeit with a saturating return. Markedly higher improvements were observed for the pharmacokinetics and safety panel assay-based task subsets.
Collapse
Affiliation(s)
| | - Lewis Mervin
- AstraZeneca
R&D, Biomedical Campus, 1 Francis Crick Ave, Cambridge CB2 0SL, U.K.
| | - Tobias Morawietz
- Bayer
Pharma
AG, Global Drug Discovery, Chemical Research,
Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany
| | - Noé Sturm
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Lukas Friedrich
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Adam Zalewski
- Amgen Research
(Munich) GmbH, Staffelseestraße
2, Munich 81477, Germany
| | - Anastasia Pentina
- Bayer AG, Machine Learning Research, Research & Development,
Pharmaceuticals, Berlin 10117, Germany
| | - Lina Humbeck
- BI Medicinal
Chemistry Department, Boehringer Ingelheim
Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany
| | - Martijn Oldenhof
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Ritsuya Niwayama
- Institut
de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France
| | | | - Nikolas Fechner
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Jaak Simm
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Adam Arany
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | | | - Rama Jabal
- Iktos, 65 rue de Prony, Paris 75017, France
| | - Arina Afanasyeva
- Modality
Informatics Group, Digital Research Solutions, Advanced Informatics
& Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Regis Loeb
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | - Shlok Verma
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Simon Harnqvist
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Matthew Holmes
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Balazs Pejo
- Budapest
University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary
| | | | - Nicholas Holway
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Arne Dieckmann
- Bayer
AG, API Production, Product Supply, Pharmaceuticals, Ernst-Schering-Straße 14, Bergkamen 59192, Germany
| | - Nicola Rieke
- NVIDIA
GmbH, Floessergasse 2, Munich 81369, Germany
| | | | - Djork-Arné Clevert
- Bayer AG, Machine Learning Research, Research & Development,
Pharmaceuticals, Berlin 10117, Germany
| | - Michael Krug
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Christopher Luscombe
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Darren Green
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Peter Ertl
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Peter Antal
- Budapest
University of Technology and Economics, Department of Measurement and Information Systems, Műegyetem rkp. 3, Budapest 1111, Hungary
| | - David Marcus
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | | | - Hideyoshi Fuji
- Modality
Informatics Group, Digital Research Solutions, Advanced Informatics
& Analytics, Astellas Pharma Inc., 21 Miyukigaoka, Tsukuba-shi, Ibaraki 305-8585, Japan
| | - Stephen Pickett
- GlaxoSmithKline, Computational Sciences, Gunnels Wood Road Stevenage, Herts SG1 2NY, U.K.
| | - Gergely Acs
- Budapest
University of Technology and Economics, Department of Networked Systems and Services, Műegyetem rkp. 3, Budapest 1111, Hungary
| | - Eric Boniface
- Substra
Foundation - Labelia Labs, 4 rue Voltaire, Nantes 44000, France
| | - Bernd Beck
- BI Medicinal
Chemistry Department, Boehringer Ingelheim
Pharma GmbH & Co. KG, Birkendorfer Str. 65, Biberach an der Riss 88397, Germany
| | - Yax Sun
- Amgen
Research, 1 Amgen Center
Drive, Thousand Oaks, California 92130, United States
| | - Arnaud Gohier
- Institut
de recherches Servier, 125 chemin de ronde Croissy-sur-Seine, Île-de-France 78290, France
| | - Friedrich Rippmann
- Merck KGaA, Global Research & Development, Frankfurter Strasse 250, Darmstadt 64293, Germany
| | - Ola Engkvist
- AstraZeneca, Molecular AI, Discovery Sciences,
R&D, Pepparedsleden
1, Mölndal 431 50, Sweden
| | - Andreas H. Göller
- Bayer
Pharma
AG, Global Drug Discovery, Chemical Research,
Computational Chemistry, Aprather Weg 18 a, Wuppertal 42096, Germany
| | - Yves Moreau
- KU
Leuven, ESAT-STADIUS, Kasteelpark Arenberg 10, Heverlee 3001, Belgium
| | | | - Ansgar Schuffenhauer
- Novartis
Institutes for BioMedical Research, Novartis Campus, Basel 4002, Switzerland
| | - Hugo Ceulemans
- Janssen
Pharmaceutica NV, Turnhoutseweg 30, Beerse 2340, Belgium
| |
Collapse
|
4
|
Lengauer T. Yves Moreau has received the 2023 Einstein Foundation Individual Award for Promoting Quality in Research. BIOINFORMATICS ADVANCES 2024; 4:vbae039. [PMID: 38566919 PMCID: PMC10985674 DOI: 10.1093/bioadv/vbae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 03/08/2024] [Indexed: 04/04/2024]
Affiliation(s)
- Thomas Lengauer
- Max Planck Institute for Informatics, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| |
Collapse
|
5
|
Huang Z, Lou S, Wang H, Li W, Liu G, Tang Y. AttentiveSkin: To Predict Skin Corrosion/Irritation Potentials of Chemicals via Explainable Machine Learning Methods. Chem Res Toxicol 2024; 37:361-373. [PMID: 38294881 DOI: 10.1021/acs.chemrestox.3c00332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2024]
Abstract
Skin Corrosion/Irritation (Corr./Irrit.) has long been a health hazard in the Globally Harmonized System (GHS). Several in silico models have been built to predict Skin Corr./Irrit. as an alternative to the increasingly restricted animal testing. However, current studies are limited by data amount/quality and model availability. To address these issues, we compiled a traceable consensus GHS data set comprising 731 Corr., 1283 Irrit., and 1205 negative (Neg.) samples from 6 governmental databases and 2 external data sets. Then, a series of binary classifiers were developed with five machine learning (ML) algorithms and six molecular representations. For 10-fold cross-validation, the best Corr. vs Neg. classifier achieved an Area Under the Receiver Operating Characteristic Curve (AUC) of 97.1%, while the best Irrit. vs Neg. classifier achieved an AUC of 84.7%. Compared with existing in silico tools on external validation, our Attentive FP classifiers showed the highest metrics on Corr. vs Neg. and the second highest accuracy on Irrit. vs Neg. The SHapley Additive exPlanation approach was further applied to figure out important molecular features, and the attention weights were visualized to perform interpretable prediction. Structural alerts associated with Skin Corr./Irrit. were also identified. The interpretable Attentive FP classifiers were integrated into the software AttentiveSkin at https://github.com/BeeBeeWong/AttentiveSkin. The conventional ML classifiers are also provided on our platform admetSAR at http://lmmd.ecust.edu.cn/admetsar2/. Considering the data deficiency and the limited model availability of Skin Corr./Irrit., we believe that our data set and models could facilitate chemical safety assessment and relevant studies.
Collapse
Affiliation(s)
- Zejun Huang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Shang Lou
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Haoqiang Wang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Weihua Li
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Guixia Liu
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Yun Tang
- Shanghai Frontiers Science Center of Optogenetic Techniques for Cell Metabolism, Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| |
Collapse
|
6
|
Jaradat NJ, Hatmal M, Alqudah D, Taha MO. Computational workflow for discovering small molecular binders for shallow binding sites by integrating molecular dynamics simulation, pharmacophore modeling, and machine learning: STAT3 as case study. J Comput Aided Mol Des 2023; 37:659-678. [PMID: 37597062 DOI: 10.1007/s10822-023-00528-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 07/26/2023] [Indexed: 08/21/2023]
Abstract
STAT3 belongs to a family of seven transcription factors. It plays an important role in activating the transcription of various genes involved in a variety of cellular processes. High levels of STAT3 are detected in several types of cancer. Hence, STAT3 inhibition is considered a promising therapeutic anti-cancer strategy. However, since STAT3 inhibitors bind to the shallow SH2 domain of the protein, it is expected that hydration water molecules play significant role in ligand-binding complicating the discovery of potent binders. To remedy this issue, we herein propose to extract pharmacophores from molecular dynamics (MD) frames of a potent co-crystallized ligand complexed within STAT3 SH2 domain. Subsequently, we employ genetic function algorithm coupled with machine learning (GFA-ML) to explore the optimal combination of MD-derived pharmacophores that can account for the variations in bioactivity among a list of inhibitors. To enhance the dataset, the training and testing lists were augmented nearly a 100-fold by considering multiple conformers of the ligands. A single significant pharmacophore emerged after 188 ns of MD simulation to represent STAT3-ligand binding. Screening the National Cancer Institute (NCI) database with this model identified one low micromolar inhibitor most likely binds to the SH2 domain of STAT3 and inhibits this pathway.
Collapse
Affiliation(s)
- Nour Jamal Jaradat
- Department of Medical Laboratory Sciences, Faculty of Applied Health Sciences, The Hashemite University, P.O. Box 330127, Zarqa, 13133, Jordan
| | - Mamon Hatmal
- Department of Medical Laboratory Sciences, Faculty of Applied Health Sciences, The Hashemite University, P.O. Box 330127, Zarqa, 13133, Jordan
| | - Dana Alqudah
- Cell Therapy Center, the University of Jordan, Amman, 11942, Jordan
| | - Mutasem Omar Taha
- Department of Pharmaceutical Sciences, Faculty of Pharmacy, University of Jordan, Amman, Jordan.
| |
Collapse
|
7
|
Rodríguez-Belenguer P, March-Vila E, Pastor M, Mangas-Sanjuan V, Soria-Olivas E. Usage of model combination in computational toxicology. Toxicol Lett 2023; 389:34-44. [PMID: 37890682 DOI: 10.1016/j.toxlet.2023.10.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 10/17/2023] [Accepted: 10/24/2023] [Indexed: 10/29/2023]
Abstract
New Approach Methodologies (NAMs) have ushered in a new era in the field of toxicology, aiming to replace animal testing. However, despite these advancements, they are not exempt from the inherent complexities associated with the study's endpoint. In this review, we have identified three major groups of complexities: mechanistic, chemical space, and methodological. The mechanistic complexity arises from interconnected biological processes within a network that are challenging to model in a single step. In the second group, chemical space complexity exhibits significant dissimilarity between compounds in the training and test series. The third group encompasses algorithmic and molecular descriptor limitations and typical class imbalance problems. To address these complexities, this work provides a guide to the usage of a combination of predictive Quantitative Structure-Activity Relationship (QSAR) models, known as metamodels. This combination of low-level models (LLMs) enables a more precise approach to the problem by focusing on different sub-mechanisms or sub-processes. For mechanistic complexity, multiple Molecular Initiating Events (MIEs) or levels of information are combined to form a mechanistic-based metamodel. Regarding the complexity arising from chemical space, two types of approaches were reviewed to construct a fragment-based chemical space metamodel: those with and without structure sharing. Metamodels with structure sharing utilize unsupervised strategies to identify data patterns and build low-level models for each cluster, which are then combined. For situations without structure sharing due to pharmaceutical industry intellectual property, the use of prediction sharing, and federated learning approaches have been reviewed. Lastly, to tackle methodological complexity, various algorithms are combined to overcome their limitations, diverse descriptors are employed to enhance problem definition and balanced dataset combinations are used to address class imbalance issues (methodological-based metamodels). Remarkably, metamodels consistently outperformed classical QSAR models across all cases, highlighting the importance of alternatives to classical QSAR models when faced with such complexities.
Collapse
Affiliation(s)
- Pablo Rodríguez-Belenguer
- Research Programme on Biomedical Informatics (GRIB), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain; Department of Pharmacy and Pharmaceutical Technology and Parasitology, Universitat de València, 46100 Valencia, Spain
| | - Eric March-Vila
- Research Programme on Biomedical Informatics (GRIB), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain
| | - Manuel Pastor
- Research Programme on Biomedical Informatics (GRIB), Department of Medicine and Life Sciences, Universitat Pompeu Fabra, Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain
| | - Victor Mangas-Sanjuan
- Department of Pharmacy and Pharmaceutical Technology and Parasitology, Universitat de València, 46100 Valencia, Spain; Interuniversity Research Institute for Molecular Recognition and Technological Development, Universitat Politècnica de València, 46100 Valencia, Spain
| | - Emilio Soria-Olivas
- IDAL, Intelligent Data Analysis Laboratory, ETSE, Universitat de València, 46100 Valencia, Spain.
| |
Collapse
|
8
|
Beckers M, Sturm N, Sirockin F, Fechner N, Stiefl N. Prediction of Small-Molecule Developability Using Large-Scale In Silico ADMET Models. J Med Chem 2023; 66:14047-14060. [PMID: 37815201 DOI: 10.1021/acs.jmedchem.3c01083] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/11/2023]
Abstract
Early in silico assessment of the potential of a series of compounds to deliver a drug is one of the major challenges in computer-assisted drug design. The goal is to identify the right chemical series of compounds out of a large chemical space to then subsequently prioritize the molecules with the highest potential to become a drug. Although multiple approaches to assess compounds have been developed over decades, the quality of these predictors is often not good enough and compounds that agree with the respective estimates are not necessarily druglike. Here, we report a novel deep learning approach that leverages large-scale predictions of ∼100 ADMET assays to assess the potential of a compound to become a relevant drug candidate. The resulting score, which we termed bPK score, substantially outperforms previous approaches and showed strong discriminative performance on data sets where previous approaches did not.
Collapse
Affiliation(s)
- Maximilian Beckers
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Noé Sturm
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Finton Sirockin
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Postfach, 4002 Basel, Switzerland
| |
Collapse
|
9
|
Schür C, Gasser L, Perez-Cruz F, Schirmer K, Baity-Jesi M. A benchmark dataset for machine learning in ecotoxicology. Sci Data 2023; 10:718. [PMID: 37853023 PMCID: PMC10584858 DOI: 10.1038/s41597-023-02612-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Accepted: 09/28/2023] [Indexed: 10/20/2023] Open
Abstract
The use of machine learning for predicting ecotoxicological outcomes is promising, but underutilized. The curation of data with informative features requires both expertise in machine learning as well as a strong biological and ecotoxicological background, which we consider a barrier of entry for this kind of research. Additionally, model performances can only be compared across studies when the same dataset, cleaning, and splittings were used. Therefore, we provide ADORE, an extensive and well-described dataset on acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches.
Collapse
Affiliation(s)
- Christoph Schür
- Eawag, Swiss Federal Institute of Aquatic Science and Technology, Dübendorf, Switzerland.
| | - Lilian Gasser
- Swiss Data Science Center (SDSC), Zürich, Switzerland
| | - Fernando Perez-Cruz
- Swiss Data Science Center (SDSC), Zürich, Switzerland
- ETH Zürich: Department of Computer Science, Zürich, Switzerland
| | - Kristin Schirmer
- Eawag, Swiss Federal Institute of Aquatic Science and Technology, Dübendorf, Switzerland
- ETH Zürich: Department of Environmental Systems Science, Zürich, Switzerland
- EPF Lausanne, School of Architecture, Civil and Environmental Engineering, Lausanne, Switzerland
| | - Marco Baity-Jesi
- Eawag, Swiss Federal Institute of Aquatic Science and Technology, Dübendorf, Switzerland
| |
Collapse
|
10
|
Feeney SV, Lui R, Guan D, Matthews S. Multiple Instance Learning Improves Ames Mutagenicity Prediction for Problematic Molecular Species. Chem Res Toxicol 2023; 36:1227-1237. [PMID: 37477941 DOI: 10.1021/acs.chemrestox.2c00372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/22/2023]
Abstract
The prediction of Ames mutagenicity continues to be a concern in both regulatory and pharmacological toxicology. Traditional quantitative structure-activity relationship (QSAR) models of mutagenicity make predictions based on molecular descriptors calculated on a chemical data set used in their training. However, it is known that molecules such as aromatic amines can be non-mutagenic themselves but metabolically activated by S9 rodent liver enzyme in Ames tests forming molecules such as iminoquinones or amine substituents that better stabilize mutagenic nitrenium ions in known pathways of mutagenicity. Modern in silico modeling methods can implicitly model these metabolites through consideration of the structural elements relevant to their formation but do not include explicit modeling of these metabolites' potential activity. These metabolites do not have a known individual mutagenicity label and, in their current state, cannot be fitted into a traditional QSAR model. Multiple instance learning (MIL) however can be applied to a group of metabolites and their parent under a single mutagenicity label. Here we trained MIL models on Ames data, first with an aromatic amines data set (n = 457), a class known to require metabolic activation, and subsequently on a larger data set (n = 6505) incorporating multiple molecular species. MIL was shown to be able to predict Ames mutagenicity with performance in line with previously established models (balanced accuracy = 0.778), suggesting its potential utility in Ames prediction applications. Furthermore, the MIL model predicted well on identified hard-to-predict molecule groups relative to the models in which these molecule groups were identified. These results are presumably due to the increased consideration of the metabolic contribution to the mutagenic outcome. Further exploration of MIL as a supplement to existing models could aid in the prediction of chemicals where implicit modeling of metabolites cannot fully grasp their characteristics. This paper demonstrates the potential of an MIL approach to modeling Ames tests with S9 and is particularly relevant to metabolically activated xenobiotic mutagens.
Collapse
Affiliation(s)
- Samuel V Feeney
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Raymond Lui
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Davy Guan
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, 2006, Australia
| | - Slade Matthews
- Computational Pharmacology & Toxicology Laboratory, Discipline of Pharmacology, School of Pharmacy, Faculty of Medicine and Health, The University of Sydney, Sydney, NSW, 2006, Australia
| |
Collapse
|
11
|
Heid E, McGill CJ, Vermeire FH, Green WH. Characterizing Uncertainty in Machine Learning for Chemistry. J Chem Inf Model 2023; 63:4012-4029. [PMID: 37338239 PMCID: PMC10336963 DOI: 10.1021/acs.jcim.3c00373] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Indexed: 06/21/2023]
Abstract
Characterizing uncertainty in machine learning models has recently gained interest in the context of machine learning reliability, robustness, safety, and active learning. Here, we separate the total uncertainty into contributions from noise in the data (aleatoric) and shortcomings of the model (epistemic), further dividing epistemic uncertainty into model bias and variance contributions. We systematically address the influence of noise, model bias, and model variance in the context of chemical property predictions, where the diverse nature of target properties and the vast chemical chemical space give rise to many different distinct sources of prediction error. We demonstrate that different sources of error can each be significant in different contexts and must be individually addressed during model development. Through controlled experiments on data sets of molecular properties, we show important trends in model performance associated with the level of noise in the data set, size of the data set, model architecture, molecule representation, ensemble size, and data set splitting. In particular, we show that 1) noise in the test set can limit a model's observed performance when the actual performance is much better, 2) using size-extensive model aggregation structures is crucial for extensive property prediction, and 3) ensembling is a reliable tool for uncertainty quantification and improvement specifically for the contribution of model variance. We develop general guidelines on how to improve an underperforming model when falling into different uncertainty contexts.
Collapse
Affiliation(s)
- Esther Heid
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Institute
of Materials Chemistry, TU Wien, 1060 Vienna, Austria
| | - Charles J. McGill
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical and Life Science Engineering, Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - Florence H. Vermeire
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
- Department
of Chemical Engineering, KU Leuven, Celestijnenlaan 200F, B-3001 Leuven, Belgium
| | - William H. Green
- Department
of Chemical Engineering, Massachusetts Institute
of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
12
|
Schaub J, Zander J, Zielesny A, Steinbeck C. Scaffold Generator: a Java library implementing molecular scaffold functionalities in the Chemistry Development Kit (CDK). J Cheminform 2022; 14:79. [PMID: 36357931 PMCID: PMC9650898 DOI: 10.1186/s13321-022-00656-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 10/30/2022] [Indexed: 11/12/2022] Open
Abstract
The concept of molecular scaffolds as defining core structures of organic molecules is utilised in many areas of chemistry and cheminformatics, e.g. drug design, chemical classification, or the analysis of high-throughput screening data. Here, we present Scaffold Generator, a comprehensive open library for the generation, handling, and display of molecular scaffolds, scaffold trees and networks. The new library is based on the Chemistry Development Kit (CDK) and highly customisable through multiple settings, e.g. five different structural framework definitions are available. For display of scaffold hierarchies, the open GraphStream Java library is utilised. Performance snapshots with natural products (NP) from the COCONUT (COlleCtion of Open Natural prodUcTs) database and drug molecules from DrugBank are reported. The generation of a scaffold network from more than 450,000 NP can be achieved within a single day.
Collapse
Affiliation(s)
- Jonas Schaub
- grid.9613.d0000 0001 1939 2794Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessing Strasse 8, 07743 Jena, Germany
| | - Julian Zander
- grid.454254.60000 0004 0647 4362Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665 Recklinghausen, Germany
| | - Achim Zielesny
- grid.454254.60000 0004 0647 4362Institute for Bioinformatics and Chemoinformatics, Westphalian University of Applied Sciences, August-Schmidt-Ring 10, 45665 Recklinghausen, Germany
| | - Christoph Steinbeck
- grid.9613.d0000 0001 1939 2794Institute for Inorganic and Analytical Chemistry, Friedrich-Schiller-University Jena, Lessing Strasse 8, 07743 Jena, Germany
| |
Collapse
|