1
|
Özçelik R, Grisoni F. A hitchhiker's guide to deep chemical language processing for bioactivity prediction. DIGITAL DISCOVERY 2024:d4dd00311j. [PMID: 39726698 PMCID: PMC11667676 DOI: 10.1039/d4dd00311j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Accepted: 12/13/2024] [Indexed: 12/28/2024]
Abstract
Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, e.g., in terms of neural network architectures, molecular representations, and hyperparameter optimization.
Collapse
Affiliation(s)
- Rıza Özçelik
- Eindhoven University of Technology, Institute for Complex Molecular Systems, Eindhoven AI Systems Institute, Dept. Biomedical Engineering Eindhoven Netherlands
- Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht Netherlands
| | - Francesca Grisoni
- Eindhoven University of Technology, Institute for Complex Molecular Systems, Eindhoven AI Systems Institute, Dept. Biomedical Engineering Eindhoven Netherlands
- Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht Netherlands
| |
Collapse
|
2
|
Eugster R, Orsi M, Buttitta G, Serafini N, Tiboni M, Casettari L, Reymond JL, Aleandri S, Luciani P. Leveraging machine learning to streamline the development of liposomal drug delivery systems. J Control Release 2024; 376:1025-1038. [PMID: 39489466 DOI: 10.1016/j.jconrel.2024.10.065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 10/03/2024] [Accepted: 10/29/2024] [Indexed: 11/05/2024]
Abstract
Drug delivery systems efficiently and safely administer therapeutic agents to specific body sites. Liposomes, spherical vesicles made of phospholipid bilayers, have become a powerful tool in this field, especially with the rise of microfluidic manufacturing during the COVID-19 pandemic. Despite its efficiency, microfluidic liposomal production poses challenges, often requiring laborious, optimization on a case-by-case basis. This is due to a lack of comprehensive understanding and robust methodologies, compounded by limited data on microfluidic production with varying lipids. Artificial intelligence offers promise in predicting lipid behaviour during microfluidic production, with the still unexploited potential of streamlining development. Herein we employ machine learning to predict critical quality attributes and process parameters for microfluidic-based liposome production. Validated models predict liposome formation, size, and production parameters, significantly advancing our understanding of lipid behaviour. Extensive model analysis enhanced interpretability and investigated underlying mechanisms, supporting the transition to microfluidic production. Unlocking the potential of machine learning in drug development can accelerate pharmaceutical innovation, making drug delivery systems more adaptable and accessible.
Collapse
Affiliation(s)
- Remo Eugster
- Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Markus Orsi
- Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Giorgio Buttitta
- Department of Chemistry and Technologies of Drugs, Sapienza University of Rome, Rome, Lazio, Italy
| | - Nicola Serafini
- Department of Biomolecular Sciences, University of Urbino Carlo Bo, Urbino, PU, Italy
| | - Mattia Tiboni
- Department of Biomolecular Sciences, University of Urbino Carlo Bo, Urbino, PU, Italy
| | - Luca Casettari
- Department of Biomolecular Sciences, University of Urbino Carlo Bo, Urbino, PU, Italy
| | - Jean-Louis Reymond
- Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Simone Aleandri
- Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland
| | - Paola Luciani
- Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Bern, Switzerland.
| |
Collapse
|
3
|
Torabi M, Haririan I, Foroumadi A, Ghanbari H, Ghasemi F. A deep learning model based on the BERT pre-trained model to predict the antiproliferative activity of anti-cancer chemical compounds. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2024; 35:971-992. [PMID: 39605280 DOI: 10.1080/1062936x.2024.2431486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2024] [Accepted: 11/13/2024] [Indexed: 11/29/2024]
Abstract
Identifying new compounds with minimal side effects to enhance patients' quality of life is the ultimate goal of drug discovery. Due to the expensive and time-consuming nature of experimental investigations and the scarcity of data in traditional QSAR studies, deep transfer learning models, such as the BERT model, have recently been suggested. This study evaluated the model's performance in predicting the anti-proliferative activity of five cancer cell lines (HeLa, MCF7, MDA-MB231, PC3, and MDA-MB) using over 3,000 synthesized molecules from PubChem. The results indicated that the model could predict the class of designed small molecules with acceptable accuracy for most cell lines, except for PC3 and MDA-MB. The model's performance was further tested on an in-house dataset of approximately 25 small molecules per cell line, based on IC50 values. The model accurately predicted the biological activity class for HeLa with an accuracy of 0.77 ± 0.4 and demonstrated acceptable performance for MCF7 and MDA-MB231, with accuracy between 0.56 and 0.66. However, the results were less reliable for PC3 and HepG2. In conclusion, the ChemBERTa fine-tuned model shows potential for predicting outcomes on in-house datasets.
Collapse
Affiliation(s)
- M Torabi
- Biosensor Research Centre, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - I Haririan
- Department of Pharmaceutics, Faculty of Pharmacy, Tehran University of Medical Sciences, Tehran, Iran
- Department of Pharmaceutical Biomaterials and Medical Biomaterials Research Center (MBRC), Faculty of Pharmacy, Tehran University of Medical Sciences, Tehran, Iran
| | - A Foroumadi
- Department of Medicinal Chemistry, Faculty of Pharmacy, Tehran University of Medical Sciences, Tehran, Iran
- Drug Design and Development Research Center, The Institute of Pharmaceutical Sciences (TIPS), Tehran University of Medical Sciences, Tehran, Iran
| | - H Ghanbari
- Department of Medical Nanotechnology, School of Advanced Technologies in Medicine, Tehran University of Medical Sciences, Tehran, Iran
| | - F Ghasemi
- Department of Bioinformatics and Systems Biology, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
- Bioinformatics Research Center, School of Pharmacy and Pharmaceutical Sciences, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|
4
|
Yada S, Nakamura Y, Wakamiya S, Aramaki E. Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop. Methods Inf Med 2024. [PMID: 39209296 DOI: 10.1055/a-2405-2489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
BACKGROUND Textual datasets (corpora) are crucial for the application of natural language processing (NLP) models. However, corpus creation in the medical field is challenging, primarily because of privacy issues with raw clinical data such as health records. Thus, the existing clinical corpora are generally small and scarce. Medical NLP (MedNLP) methodologies perform well with limited data availability. OBJECTIVES We present the outcomes of the Real-MedNLP workshop, which was conducted using limited and parallel medical corpora. Real-MedNLP exhibits three distinct characteristics: (1) limited annotated documents: the training data comprise only a small set (∼100) of case reports (CRs) and radiology reports (RRs) that have been annotated. (2) Bilingually parallel: the constructed corpora are parallel in Japanese and English. (3) Practical tasks: the workshop addresses fundamental tasks, such as named entity recognition (NER) and applied practical tasks. METHODS We propose three tasks: NER of ∼100 available documents (Task 1), NER based only on annotation guidelines for humans (Task 2), and clinical applications (Task 3) consisting of adverse drug effect (ADE) detection for CRs and identical case identification (CI) for RRs. RESULTS Nine teams participated in this study. The best systems achieved 0.65 and 0.89 F1-scores for CRs and RRs in Task 1, whereas the top scores in Task 2 decreased by 50 to 70%. In Task 3, ADE reports were detected by up to 0.64 F1-score, and CI scored up to 0.96 binary accuracy. CONCLUSION Most systems adopt medical-domain-specific pretrained language models using data augmentation methods. Despite the challenge of limited corpus size in Tasks 1 and 2, recent approaches are promising because the partial match scores reached ∼0.8-0.9 F1-scores. Task 3 applications revealed that the different availabilities of external language resources affected the performance per language.
Collapse
Affiliation(s)
- Shuntaro Yada
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
| | - Yuta Nakamura
- 22nd Century Medical and Research Center, The University of Tokyo Hospital, Tokyo, Japan
| | - Shoko Wakamiya
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
| | - Eiji Aramaki
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
| |
Collapse
|
5
|
Phan CP, Phan B, Chiang JH. Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini. Database (Oxford) 2024; 2024:baae104. [PMID: 39383312 PMCID: PMC11463225 DOI: 10.1093/database/baae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/21/2024] [Accepted: 09/04/2024] [Indexed: 10/11/2024]
Abstract
Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.
Collapse
Affiliation(s)
- Cong-Phuoc Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Ben Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| |
Collapse
|
6
|
Huang Y, Li S, Hu W, Shao S, Li Q, Zhang L. Language Model-Assisted Machine Learning, Photoelectrochemical, and First-Principles Investigation of Compatible Solvents for a CH 3NH 3PbI 3 Film in Water. ACS APPLIED MATERIALS & INTERFACES 2024; 16:51595-51607. [PMID: 39283994 DOI: 10.1021/acsami.4c06276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/28/2024]
Abstract
Machine learning and data-driven methods have attracted a significant amount of attention for the acceleration of the design of molecules and materials. In this study, a material design protocol based on multimode modeling that combines literature modeling, numerical data collection, textual descriptor design, genetic modeling, experimental validation, first-principles calculation, and theoretical efficiency calculation is proposed, with a case study on designing compatible complex solvent molecules for a halide perovskite film, which is notorious for optoelectronic deactivation under hostile conditions, especially in water. In the multimode modeling design process, the textual descriptors play the central role and store rich literature scientific knowledge, which starts from the construction of a high-dimension literature model based on scientific articles and is realized by a genetic algorithm for materials predictions. The prediction is substantiated by follow-up experiments and first-principles calculations, leading to the successful identification of effective molecular combinations delivering an unprecedented large aqueous photocurrent (increasing by 3 orders of magnitude compared with that of CH3NH3PbI3) and remarkable aqueous stability (improving from 36% to 89% after immersion in water) under the hostile condition. This study provides a practical route via multimode modeling for accelerating the design of molecule-modified and solution-processed materials in a real scenario.
Collapse
Affiliation(s)
- Yiru Huang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Shenyue Li
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Wenguang Hu
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Shaofeng Shao
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Qingfang Li
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Lei Zhang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| |
Collapse
|
7
|
Huang Y, Zhang L, Deng H, Mao J. NJmat: Data-Driven Machine Learning Interface to Accelerate Material Design. J Chem Inf Model 2024; 64:6477-6491. [PMID: 39133673 DOI: 10.1021/acs.jcim.4c00493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Machine learning techniques have significantly transformed the way materials scientists conduct research. However, the widespread deployment of machine learning software in daily experimental and simulation research for materials and chemical design has been limited. This is partly due to the substantial time investment and learning curve associated with mastering the necessary codes and computational environments. In this paper, we introduce a user-friendly, data-driven machine learning interface featuring multiple "button-clicking" functionalities to streamline the design of materials and chemicals. This interface automates the processes of transforming materials and molecules, performing feature selection, constructing machine learning models, making virtual predictions, and visualizing results. Such automation accelerates materials prediction and analysis in the inverse design process, aligning with the time criteria outlined by the Materials Genome Initiative. With simple button clicks, researchers can build machine learning models and predict new materials once they have gathered experimental or simulation data. Beyond the ease of use, NJmat offers three additional features for data-driven materials design: (1) automatic feature generation for both inorganic materials (from chemical formulas) and organic molecules (from SMILES), (2) automatic generation of Shapley plots, and (3) automatic construction of "white-box" genetic models and decision trees to provide scientific insights. We present case studies on surface design for halide perovskite materials encompassing both inorganic and organic species. These case studies illustrate general machine learning models for virtual predictions as well as the automatic featurization and Shapley/genetic model construction capabilities. We anticipate that this software tool will expedite materials and molecular design within the scope of the Materials Genome Initiative, particularly benefiting experimentalists.
Collapse
Affiliation(s)
- Yiru Huang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Lei Zhang
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Hangyuan Deng
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| | - Junfei Mao
- Department of Materials Physics, School of Chemistry and Materials Science, Nanjing University of Information Science & Technology, Nanjing 210044, China
| |
Collapse
|
8
|
Meijer D, Beniddir MA, Coley CW, Mejri YM, Öztürk M, van der Hooft JJJ, Medema MH, Skiredj A. Empowering natural product science with AI: leveraging multimodal data and knowledge graphs. Nat Prod Rep 2024. [PMID: 39148455 PMCID: PMC11327853 DOI: 10.1039/d4np00008k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Indexed: 08/17/2024]
Abstract
Artificial intelligence (AI) is accelerating how we conduct science, from folding proteins with AlphaFold and summarizing literature findings with large language models, to annotating genomes and prioritizing newly generated molecules for screening using specialized software. However, the application of AI to emulate human cognition in natural product research and its subsequent impact has so far been limited. One reason for this limited impact is that available natural product data is multimodal, unbalanced, unstandardized, and scattered across many data repositories. This makes natural product data challenging to use with existing deep learning architectures that consume fairly standardized, often non-relational, data. It also prevents models from learning overarching patterns in natural product science. In this Viewpoint, we address this challenge and support ongoing initiatives aimed at democratizing natural product data by collating our collective knowledge into a knowledge graph. By doing so, we believe there will be an opportunity to use such a knowledge graph to develop AI models that can truly mimic natural product scientists' decision-making.
Collapse
Affiliation(s)
- David Meijer
- Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands.
| | - Mehdi A Beniddir
- Equipe "Chimie des Substances Naturelles", Université Paris-Saclay, CNRS, BioCIS, 17 Avenue des Sciences, 91400 Orsay, France.
| | - Connor W Coley
- Massachusetts Institute of Technology, Department of Chemical Engineering, USA
| | - Yassine M Mejri
- Equipe "Chimie des Substances Naturelles", Université Paris-Saclay, CNRS, BioCIS, 17 Avenue des Sciences, 91400 Orsay, France.
- Université Paris Dauphine, PSL Research University, CNRS, Lamsade, 75016 Paris, France
| | - Meltem Öztürk
- Université Paris Dauphine, PSL Research University, CNRS, Lamsade, 75016 Paris, France
| | - Justin J J van der Hooft
- Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands.
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands.
| | - Adam Skiredj
- Equipe "Chimie des Substances Naturelles", Université Paris-Saclay, CNRS, BioCIS, 17 Avenue des Sciences, 91400 Orsay, France.
| |
Collapse
|
9
|
Bou A, Thomas M, Dittert S, Navarro C, Majewski M, Wang Y, Patel S, Tresadern G, Ahmad M, Moens V, Sherman W, Sciabola S, De Fabritiis G. ACEGEN: Reinforcement Learning of Generative Chemical Agents for Drug Discovery. J Chem Inf Model 2024; 64:5900-5911. [PMID: 39092857 PMCID: PMC11581341 DOI: 10.1021/acs.jcim.4c00895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 07/03/2024] [Accepted: 07/19/2024] [Indexed: 08/04/2024]
Abstract
In recent years, reinforcement learning (RL) has emerged as a valuable tool in drug design, offering the potential to propose and optimize molecules with desired properties. However, striking a balance between capabilities, flexibility, reliability, and efficiency remains challenging due to the complexity of advanced RL algorithms and the significant reliance on specialized code. In this work, we introduce ACEGEN, a comprehensive and streamlined toolkit tailored for generative drug design, built using TorchRL, a modern RL library that offers thoroughly tested reusable components. We validate ACEGEN by benchmarking against other published generative modeling algorithms and show comparable or improved performance. We also show examples of ACEGEN applied in multiple drug discovery case studies. ACEGEN is accessible at https://github.com/acellera/acegen-open and available for use under the MIT license.
Collapse
Affiliation(s)
- Albert Bou
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), C Dr. Aiguader 88, 08003 Barcelona, Spain
- Acellera
Labs, C Dr. Trueta 183, 08005, Barcelona, Spain
| | - Morgan Thomas
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), C Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Sebastian Dittert
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), C Dr. Aiguader 88, 08003 Barcelona, Spain
| | - Carles Navarro
- Acellera
Labs, C Dr. Trueta 183, 08005, Barcelona, Spain
| | | | - Ye Wang
- Biogen
Research and Development, 225 Binney Street, Cambridge, Massachusetts 02142, United States
| | - Shivam Patel
- Psivant
Therapeutics, 451 D Street, Boston, Massachusetts 02210, United States
| | - Gary Tresadern
- In
Silico Discovery, Janssen Research &
Development, Janssen Pharmaceutica N. V., Turnhoutseweg 30, B-2340 Beerse, Belgium
| | - Mazen Ahmad
- In
Silico Discovery, Janssen Research &
Development, Janssen Pharmaceutica N. V., Turnhoutseweg 30, B-2340 Beerse, Belgium
| | - Vincent Moens
- PyTorch
Team, Meta, 11−21 Canal Reach, London, N1C 4DB, United Kingdom
| | - Woody Sherman
- Psivant
Therapeutics, 451 D Street, Boston, Massachusetts 02210, United States
| | - Simone Sciabola
- Biogen
Research and Development, 225 Binney Street, Cambridge, Massachusetts 02142, United States
| | - Gianni De Fabritiis
- Computational
Science Laboratory, Universitat Pompeu Fabra,
Barcelona Biomedical Research Park (PRBB), C Dr. Aiguader 88, 08003 Barcelona, Spain
- Acellera
Labs, C Dr. Trueta 183, 08005, Barcelona, Spain
- Institució
Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, 08010 Barcelona, Spain
| |
Collapse
|
10
|
Chen H, Yoshimori A, Bajorath J. Extension of multi-site analogue series with potent compounds using a bidirectional transformer-based chemical language model. RSC Med Chem 2024; 15:2527-2537. [PMID: 39026633 PMCID: PMC11253848 DOI: 10.1039/d4md00423j] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Accepted: 06/15/2024] [Indexed: 07/20/2024] Open
Abstract
Generating potent compounds for evolving analogue series (AS) is a key challenge in medicinal chemistry. The versatility of chemical language models (CLMs) makes it possible to formulate this challenge as an off-the-beaten-path prediction task. In this work, we have devised a coding and tokenization scheme for evolving AS with multiple substitution sites (multi-site AS) and implemented a bidirectional transformer to predict new potent analogues for such series. Scientific foundations of this approach are discussed and, as a benchmark, the transformer model is compared to a recurrent neural network (RNN) for the prediction of analogues of AS with single substitution sites. Furthermore, the transformer is shown to successfully predict potent analogues with varying R-group combinations for multi-site AS having activity against many different targets. Prediction of R-group combinations for extending AS with potent compounds represents a novel approach for compound optimization.
Collapse
Affiliation(s)
- Hengwei Chen
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, University of Bonn Friedrich-Hirzebruch-Allee 5/6 D-53115 Bonn Germany +49 228 7369 100
| | - Atsushi Yoshimori
- Institute for Theoretical Medicine, Inc. 26-1 Muraoka-Higashi 2-chome Fujisawa Kanagawa 251-0012 Japan
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, University of Bonn Friedrich-Hirzebruch-Allee 5/6 D-53115 Bonn Germany +49 228 7369 100
- Lamarr Institute for Machine Learning and Artificial Intelligence, University of Bonn Friedrich-Hirzebruch-Allee 5/6 D-53115 Bonn Germany
| |
Collapse
|
11
|
Vittoria Togo M, Mastrolorito F, Orfino A, Graps EA, Tondo AR, Altomare CD, Ciriaco F, Trisciuzzi D, Nicolotti O, Amoroso N. Where developmental toxicity meets explainable artificial intelligence: state-of-the-art and perspectives. Expert Opin Drug Metab Toxicol 2024; 20:561-577. [PMID: 38141160 DOI: 10.1080/17425255.2023.2298827] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 12/20/2023] [Indexed: 12/24/2023]
Abstract
INTRODUCTION The application of Artificial Intelligence (AI) to predictive toxicology is rapidly increasing, particularly aiming to develop non-testing methods that effectively address ethical concerns and reduce economic costs. In this context, Developmental Toxicity (Dev Tox) stands as a key human health endpoint, especially significant for safeguarding maternal and child well-being. AREAS COVERED This review outlines the existing methods employed in Dev Tox predictions and underscores the benefits of utilizing New Approach Methodologies (NAMs), specifically focusing on eXplainable Artificial Intelligence (XAI), which proves highly efficient in constructing reliable and transparent models aligned with recommendations from international regulatory bodies. EXPERT OPINION The limited availability of high-quality data and the absence of dependable Dev Tox methodologies render XAI an appealing avenue for systematically developing interpretable and transparent models, which hold immense potential for both scientific evaluations and regulatory decision-making.
Collapse
Affiliation(s)
- Maria Vittoria Togo
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Fabrizio Mastrolorito
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Angelica Orfino
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Elisabetta Anna Graps
- ARESS Puglia - Agenzia Regionale strategica per laSalute ed il Sociale, Presidenza della Regione Puglia", Bari, Italy
| | - Anna Rita Tondo
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Cosimo Damiano Altomare
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Fulvio Ciriaco
- Department of Chemistry, Universitá degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Daniela Trisciuzzi
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Orazio Nicolotti
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| | - Nicola Amoroso
- Department of Pharmacy - Pharmaceutical Sciences, Università degli Studi di Bari "Aldo Moro", Bari, Italy
| |
Collapse
|
12
|
Zhang J, Zhao L, Wang W, Zhang Q, Wang XT, Xing DF, Ren NQ, Lee DJ, Chen C. Large language model for horizontal transfer of resistance gene: From resistance gene prevalence detection to plasmid conjugation rate evaluation. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 931:172466. [PMID: 38626826 DOI: 10.1016/j.scitotenv.2024.172466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 04/10/2024] [Accepted: 04/11/2024] [Indexed: 05/07/2024]
Abstract
The burgeoning issue of plasmid-mediated resistance genes (ARGs) dissemination poses a significant threat to environmental integrity. However, the prediction of ARGs prevalence is overlooked, especially for emerging ARGs that are potentially evolving gene exchange hotspot. Here, we explored to classify plasmid or chromosome sequences and detect resistance gene prevalence by using DNABERT. Initially, the DNABERT fine-tuned in plasmid and chromosome sequences followed by multilayer perceptron (MLP) classifier could achieve 0.764 AUC (Area under curve) on external datasets across 23 genera, outperforming 0.02 AUC than traditional statistic-based model. Furthermore, Escherichia, Pseudomonas single genera based model were also be trained to explore its predict performance to ARGs prevalence detection. By integrating K-mer frequency attributes, our model could boost the performance to predict the prevalence of ARGs in an external dataset in Escherichia with 0.0281-0.0615 AUC and Pseudomonas with 0.0196-0.0928 AUC. Finally, we established a random forest model aimed at forecasting the relative conjugation transfer rate of plasmids with 0.7956 AUC, drawing on data from existing literature. It identifies the plasmid's repression status, cellular density, and temperature as the most important factors influencing transfer frequency. With these two models combined, they provide useful reference for quick and low-cost integrated evaluation of resistance gene transfer, accelerating the process of computer-assisted quantitative risk assessment of ARGs transfer in environmental field.
Collapse
Affiliation(s)
- Jiabin Zhang
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China
| | - Lei Zhao
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China
| | - Wei Wang
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China.
| | - Quan Zhang
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China
| | - Xue-Ting Wang
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China
| | - De-Feng Xing
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China
| | - Nan-Qi Ren
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China; Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055, China
| | - Duu-Jong Lee
- Department of Mechanical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
| | - Chuan Chen
- State Key Laboratory of Urban Water Resource and Environment, School of Environment, Harbin Institute of Technology, Harbin, Heilongjiang Province 150090, China.
| |
Collapse
|
13
|
Keto A, Guo T, Underdue M, Stuyver T, Coley CW, Zhang X, Krenske EH, Wiest O. Data-Efficient, Chemistry-Aware Machine Learning Predictions of Diels-Alder Reaction Outcomes. J Am Chem Soc 2024; 146:16052-16061. [PMID: 38822795 DOI: 10.1021/jacs.4c03131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/03/2024]
Abstract
The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.
Collapse
Affiliation(s)
- Angus Keto
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Taicheng Guo
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Morgan Underdue
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Thijs Stuyver
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Xiangliang Zhang
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana 46556, United States
| | - Elizabeth H Krenske
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Olaf Wiest
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana 46556, United States
| |
Collapse
|
14
|
Wang H, Chen B, Sun H, Zhang Y. Carbon-based molecular properties efficiently predicted by deep learning-based quantum chemical simulation with large language models. Comput Biol Med 2024; 176:108531. [PMID: 38728991 DOI: 10.1016/j.compbiomed.2024.108531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2024] [Revised: 04/21/2024] [Accepted: 04/28/2024] [Indexed: 05/12/2024]
Abstract
The prediction of thermodynamic properties of carbon-based molecules based on their geometrical conformation using fluctuation and density functional theories has achieved great success in the field of energy chemistry, while the excessive computational cost provides both opportunities and challenges for the integration of machine learning. In this work, a deep learning-based quantum chemical prediction model was constructed for efficient prediction of thermodynamic properties of carbon-based molecules. We constructed a novel framework - encoding the 3D information into a large language model (LLM), which in turn generates a 2D SMILES string, while embedding a learnable encoding designed to preserve the integrity of the original 3D information, providing better structural information for the model. Additionally, we have designed an equivariant learning module to encompass representations of conformations and feature learning for conformational sampling. This framework aims to predict thermodynamic properties more accurately than learning from 2D topology alone, while providing faster computational speeds than conventional simulations. By combining machine learning and quantum chemistry, we pioneer efficient practical applications in the field of energy chemistry. Our model advances the integration of data-driven and physics-based modeling to unlock novel insights into carbon-based molecules.
Collapse
Affiliation(s)
- Haoyu Wang
- University of Shanghai for Science and Technology, Shanghai, China; School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China.
| | - Bin Chen
- University of Shanghai for Science and Technology, Shanghai, China; School of Mechanical Engineering, Shanghai Jiao Tong University, Shanghai, China.
| | - Hangling Sun
- Hengtu Imalligent Technology (Shanghai) Co., Ltd., Shanghai, China
| | - Yuxuan Zhang
- University of Shanghai for Science and Technology, Shanghai, China
| |
Collapse
|
15
|
Li Y, Liu B, Deng J, Guo Y, Du H. Image-based molecular representation learning for drug development: a survey. Brief Bioinform 2024; 25:bbae294. [PMID: 38920347 PMCID: PMC11200195 DOI: 10.1093/bib/bbae294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 05/19/2024] [Accepted: 06/08/2024] [Indexed: 06/27/2024] Open
Abstract
Artificial intelligence (AI) powered drug development has received remarkable attention in recent years. It addresses the limitations of traditional experimental methods that are costly and time-consuming. While there have been many surveys attempting to summarize related research, they only focus on general AI or specific aspects such as natural language processing and graph neural network. Considering the rapid advance on computer vision, using the molecular image to enable AI appears to be a more intuitive and effective approach since each chemical substance has a unique visual representation. In this paper, we provide the first survey on image-based molecular representation for drug development. The survey proposes a taxonomy based on the learning paradigms in computer vision and reviews a large number of corresponding papers, highlighting the contributions of molecular visual representation in drug development. Besides, we discuss the applications, limitations and future directions in the field. We hope this survey could offer valuable insight into the use of image-based molecular representation learning in the context of drug development.
Collapse
Affiliation(s)
- Yue Li
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| | - Bingyan Liu
- School of Computer Science, Beijing University of Posts and Telecommunications, No.10 Xituchen Street, 100876, Beijing, China
| | - Jinyan Deng
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| | - Yi Guo
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| | - Hongbo Du
- Division of Gastroenterology, Dongzhimen Hospital, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
- Institute of Liver Disease, Beijing University of Chinese Medicine, No. 5 Haiyun Warehouse, 100700, Beijing, China
| |
Collapse
|
16
|
Rana D, Pflüger PM, Hölter NP, Tan G, Glorius F. Standardizing Substrate Selection: A Strategy toward Unbiased Evaluation of Reaction Generality. ACS CENTRAL SCIENCE 2024; 10:899-906. [PMID: 38680564 PMCID: PMC11046462 DOI: 10.1021/acscentsci.3c01638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 03/14/2024] [Accepted: 03/18/2024] [Indexed: 05/01/2024]
Abstract
With over 10,000 new reaction protocols arising every year, only a handful of these procedures transition from academia to application. A major reason for this gap stems from the lack of comprehensive knowledge about a reaction's scope, i.e., to which substrates the protocol can or cannot be applied. Even though chemists invest substantial effort to assess the scope of new protocols, the resulting scope tables involve significant biases, reducing their expressiveness. Herein we report a standardized substrate selection strategy designed to mitigate these biases and evaluate the applicability, as well as the limits, of any chemical reaction. Unsupervised learning is utilized to map the chemical space of industrially relevant molecules. Subsequently, potential substrate candidates are projected onto this universal map, enabling the selection of a structurally diverse set of substrates with optimal relevance and coverage. By testing our methodology on different chemical reactions, we were able to demonstrate its effectiveness in finding general reactivity trends by using a few highly representative examples. The developed methodology empowers chemists to showcase the unbiased applicability of novel methodologies, facilitating their practical applications. We hope that this work will trigger interdisciplinary discussions about biases in synthetic chemistry, leading to improved data quality.
Collapse
Affiliation(s)
- Debanjan Rana
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Philipp M. Pflüger
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Niklas P. Hölter
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Guangying Tan
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| | - Frank Glorius
- Universität Münster,
Organisch-Chemisches Institut, Corrensstraße 36, 48149 Münster, Germany
| |
Collapse
|
17
|
Hartog PBR, Krüger F, Genheden S, Tetko IV. Using test-time augmentation to investigate explainable AI: inconsistencies between method, model and human intuition. J Cheminform 2024; 16:39. [PMID: 38576047 PMCID: PMC10993590 DOI: 10.1186/s13321-024-00824-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/09/2024] [Indexed: 04/06/2024] Open
Abstract
Stakeholders of machine learning models desire explainable artificial intelligence (XAI) to produce human-understandable and consistent interpretations. In computational toxicity, augmentation of text-based molecular representations has been used successfully for transfer learning on downstream tasks. Augmentations of molecular representations can also be used at inference to compare differences between multiple representations of the same ground-truth. In this study, we investigate the robustness of eight XAI methods using test-time augmentation for a molecular-representation model in the field of computational toxicity prediction. We report significant differences between explanations for different representations of the same ground-truth, and show that randomized models have similar variance. We hypothesize that text-based molecular representations in this and past research reflect tokenization more than learned parameters. Furthermore, we see a greater variance between in-domain predictions than out-of-domain predictions, indicating XAI measures something other than learned parameters. Finally, we investigate the relative importance given to expert-derived structural alerts and find similar importance given irregardless of applicability domain, randomization and varying training procedures. We therefore caution future research to validate their methods using a similar comparison to human intuition without further investigation. SCIENTIFIC CONTRIBUTION: In this research we critically investigate XAI through test-time augmentation, contrasting previous assumptions about using expert validation and showing inconsistencies within models for identical representations. SMILES augmentation has been used to increase model accuracy, but was here adapted from the field of image test-time augmentation to be used as an independent indication of the consistency within SMILES-based molecular representation models.
Collapse
Affiliation(s)
- Peter B R Hartog
- Molecular AI, Discovery Sciences, R &D, AstraZeneca, 431 83, Mölndal, Sweden.
- Institute of Structural Biology, Helmholtz Munich, Munich, 85764, Germany.
| | - Fabian Krüger
- Institute of Structural Biology, Helmholtz Munich, Munich, 85764, Germany
| | - Samuel Genheden
- Molecular AI, Discovery Sciences, R &D, AstraZeneca, 431 83, Mölndal, Sweden
| | - Igor V Tetko
- Institute of Structural Biology, Helmholtz Munich, Munich, 85764, Germany
| |
Collapse
|
18
|
Arora S, Chettri S, Percha V, Kumar D, Latwal M. Artifical intelligence: a virtual chemist for natural product drug discovery. J Biomol Struct Dyn 2024; 42:3826-3835. [PMID: 37232451 DOI: 10.1080/07391102.2023.2216295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 05/12/2023] [Indexed: 05/27/2023]
Abstract
Nature is full of a bundle of medicinal substances and its product perceived as a prerogative structure to collaborate with protein drug targets. The natural product's (NPs) structure heterogeneity and eccentric characteristics inspired scientists to work on natural product-inspired medicine. To gear NP drug-finding artificial intelligence (AI) to confront and excavate unexplored opportunities. Natural product-inspired drug discoveries based on AI to act as an innovative tool for molecular design and lead discovery. Various models of machine learning produce quickly synthesizable mimetics of the natural products templates. The invention of novel natural products mimetics by computer-assisted technology provides a feasible strategy to get the natural product with defined bio-activities. AI's hit rate makes its high importance by improving trail patterns such as dose selection, trail life span, efficacy parameters, and biomarkers. Along these lines, AI methods can be a successful tool in a targeted way to formulate advanced medicinal applications for natural products. 'Prediction of future of natural product based drug discovery is not magic, actually its artificial intelligence'Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Shefali Arora
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| | - Sukanya Chettri
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| | - Versha Percha
- Department of Pharmaceutical Chemistry, Dolphin(PG) Institute of Biomedical and Natural Sciences, Dehradun, Uttarakhand, India
| | - Deepak Kumar
- Department of Pharmaceutical Chemistry, Dolphin(PG) Institute of Biomedical and Natural Sciences, Dehradun, Uttarakhand, India
| | - Mamta Latwal
- Department of Chemistry, University of Petroleum and Energy Studies, Dehradun, Uttarakhand, India
| |
Collapse
|
19
|
Temizer AB, Uludoğan G, Özçelik R, Koulani T, Ozkirimli E, Ulgen KO, Karali N, Özgür A. Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties. Mol Inform 2024; 43:e202300249. [PMID: 38196065 DOI: 10.1002/minf.202300249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 11/13/2023] [Accepted: 01/06/2024] [Indexed: 01/11/2024]
Abstract
Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.
Collapse
Affiliation(s)
- Asu Busra Temizer
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
- Department of Pharmaceutical Chemistry, Institute of Health Sciences, İstanbul University, İstanbul, Turkey
| | - Gökçe Uludoğan
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Rıza Özçelik
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Taha Koulani
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
- Department of Pharmaceutical Chemistry, Institute of Health Sciences, İstanbul University, İstanbul, Turkey
| | - Elif Ozkirimli
- Science and Research Informatics, F. Hoffmann-La Roche Ltd, Basel, Switzerland
| | - Kutlu O Ulgen
- Department of Chemical Engineering, Boğaziçi University, İstanbul, Turkey
| | - Nilgun Karali
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| |
Collapse
|
20
|
Ma M, Zhang X, Zhou L, Han Z, Shi Y, Li J, Wu L, Xu Z, Zhu W. D3Rings: A Fast and Accurate Method for Ring System Identification and Deep Generation of Drug-Like Cyclic Compounds. J Chem Inf Model 2024; 64:724-736. [PMID: 38206320 DOI: 10.1021/acs.jcim.3c01657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2024]
Abstract
Continuous exploration of the chemical space of molecules to find ligands with high affinity and specificity for specific targets is an important topic in drug discovery. A focus on cyclic compounds, particularly natural compounds with diverse scaffolds, provides important insights into novel molecular structures for drug design. However, the complexity of their ring structures has hindered the applicability of widely accepted methods and software for the systematic identification and classification of cyclic compounds. Herein, we successfully developed a new method, D3Rings, to identify acyclic, monocyclic, spiro ring, fused and bridged ring, and cage ring compounds, as well as macrocyclic compounds. By using D3Rings, we completed the statistics of cyclic compounds in three different databases, e.g., ChEMBL, DrugBank, and COCONUT. The results demonstrated the richness of ring structures in natural products, especially spiro, macrocycles, and fused and bridged rings. Based on this, three deep generative models, namely, VAE, AAE, and CharRNN, were trained and used to construct two data sets similar to DrugBank and COCONUT but 10 times larger than them. The enlarged data sets were then used to explore the molecular chemical space, focusing on complex ring structures, for novel drug discovery and development. Docking experiments with the newly generated COCONUT-like data set against three SARS-CoV-2 target proteins revealed that an expanded compound database improves molecular docking results. Cyclic structures exhibited the best docking scores among the top-ranked docking molecules. These results suggest the importance of exploring the chemical space of structurally novel cyclic compounds and continuous expansion of the library of drug-like compounds to facilitate the discovery of potent ligands with high binding affinity to specific targets. D3Rings is now freely available at http://www.d3pharma.com/D3Rings/.
Collapse
Affiliation(s)
- Minfei Ma
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Xinben Zhang
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Liping Zhou
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zijian Han
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yulong Shi
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jintian Li
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Leyun Wu
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhijian Xu
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Weiliang Zhu
- Stake Key Laboratory of Drug Research; Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- School of Pharmacy, University of Chinese Academy of Sciences, Beijing 100049, China
| |
Collapse
|
21
|
Adebar N, Keupp J, Emenike VN, Kühlborn J, Vom Dahl L, Möckel R, Smiatek J. Scientific Deep Machine Learning Concepts for the Prediction of Concentration Profiles and Chemical Reaction Kinetics: Consideration of Reaction Conditions. J Phys Chem A 2024; 128:929-944. [PMID: 38271617 DOI: 10.1021/acs.jpca.3c06265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Abstract
Emerging concepts from scientific deep machine learning such as physics-informed neural networks (PINNs) enable a data-driven approach for the study of complex kinetic problems. We present an extended framework that combines the advantages of PINNs with the detailed consideration of experimental parameter variations for the simulation and prediction of chemical reaction kinetics. The approach is based on truncated Taylor series expansions for the underlying fundamental equations, whereby the external variations can be interpreted as perturbations of the kinetic parameters. Accordingly, our method allows for an efficient consideration of experimental parameter settings and their influence on the concentration profiles and reaction kinetics. A particular advantage of our approach, in addition to the consideration of univariate and multivariate parameter variations, is the robust model-based exploration of the parameter space to determine optimal reaction conditions in combination with advanced reaction insights. The benefits of this concept are demonstrated for higher-order chemical reactions including catalytic and oscillatory systems in combination with small amounts of training data. All predicted values show a high level of accuracy, demonstrating the broad applicability and flexibility of our approach.
Collapse
Affiliation(s)
- Niklas Adebar
- Development NCE, Chemical Development, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218 Ingelheim (Rhein), Germany
| | - Julian Keupp
- Development NCE, Chemical Development, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218 Ingelheim (Rhein), Germany
| | - Victor N Emenike
- HP BioP Launch and Innovation, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218 Ingelheim (Rhein), Germany
| | - Jonas Kühlborn
- Development NCE, Chemical Development, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218 Ingelheim (Rhein), Germany
| | - Lisa Vom Dahl
- Development NCE, Analytical Development, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218 Ingelheim (Rhein), Germany
| | - Robert Möckel
- Development NCE, Chemical Development, Boehringer Ingelheim Pharma GmbH & Co. KG, D-55218 Ingelheim (Rhein), Germany
| | - Jens Smiatek
- Institute for Computational Physics, University of Stuttgart, D-70569 Stuttgart, Germany
- Development NCE, Strategy NCEs, Boehringer Ingelheim Pharma GmbH & Co. KG, D-88397 Biberach (Riss), Germany
| |
Collapse
|
22
|
Gangwal A, Ansari A, Ahmad I, Azad AK, Kumarasamy V, Subramaniyan V, Wong LS. Generative artificial intelligence in drug discovery: basic framework, recent advances, challenges, and opportunities. Front Pharmacol 2024; 15:1331062. [PMID: 38384298 PMCID: PMC10879372 DOI: 10.3389/fphar.2024.1331062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 01/17/2024] [Indexed: 02/23/2024] Open
Abstract
There are two main ways to discover or design small drug molecules. The first involves fine-tuning existing molecules or commercially successful drugs through quantitative structure-activity relationships and virtual screening. The second approach involves generating new molecules through de novo drug design or inverse quantitative structure-activity relationship. Both methods aim to get a drug molecule with the best pharmacokinetic and pharmacodynamic profiles. However, bringing a new drug to market is an expensive and time-consuming endeavor, with the average cost being estimated at around $2.5 billion. One of the biggest challenges is screening the vast number of potential drug candidates to find one that is both safe and effective. The development of artificial intelligence in recent years has been phenomenal, ushering in a revolution in many fields. The field of pharmaceutical sciences has also significantly benefited from multiple applications of artificial intelligence, especially drug discovery projects. Artificial intelligence models are finding use in molecular property prediction, molecule generation, virtual screening, synthesis planning, repurposing, among others. Lately, generative artificial intelligence has gained popularity across domains for its ability to generate entirely new data, such as images, sentences, audios, videos, novel chemical molecules, etc. Generative artificial intelligence has also delivered promising results in drug discovery and development. This review article delves into the fundamentals and framework of various generative artificial intelligence models in the context of drug discovery via de novo drug design approach. Various basic and advanced models have been discussed, along with their recent applications. The review also explores recent examples and advances in the generative artificial intelligence approach, as well as the challenges and ongoing efforts to fully harness the potential of generative artificial intelligence in generating novel drug molecules in a faster and more affordable manner. Some clinical-level assets generated form generative artificial intelligence have also been discussed in this review to show the ever-increasing application of artificial intelligence in drug discovery through commercial partnerships.
Collapse
Affiliation(s)
- Amit Gangwal
- Department of Natural Product Chemistry, Shri Vile Parle Kelavani Mandal’s Institute of Pharmacy, Dhule, Maharashtra, India
| | - Azim Ansari
- Computer Aided Drug Design Center Shri Vile Parle Kelavani Mandal’s Institute of Pharmacy, Dhule, Maharashtra, India
| | - Iqrar Ahmad
- Department of Pharmaceutical Chemistry, Prof. Ravindra Nikam College of Pharmacy, Dhule, India
| | - Abul Kalam Azad
- Faculty of Pharmacy, University College of MAIWP International, Batu Caves, Malaysia
| | - Vinoth Kumarasamy
- Department of Parasitology and Medical Entomology, Faculty of Medicine, Universiti Kebangsaan Malaysia, Cheras, Malaysia
| | - Vetriselvan Subramaniyan
- Pharmacology Unit, Jeffrey Cheah School of Medicine and Health Sciences, Monash University Malaysia, Selangor, Malaysia
- School of Bioengineering and Biosciences, Lovely Professional University, Phagwara, Punjab, India
| | - Ling Shing Wong
- Faculty of Health and Life Sciences, INTI International University, Nilai, Malaysia
| |
Collapse
|
23
|
Abstract
Smart healthcare has achieved significant progress in recent years. Emerging artificial intelligence (AI) technologies enable various smart applications across various healthcare scenarios. As an essential technology powered by AI, natural language processing (NLP) plays a key role in smart healthcare due to its capability of analysing and understanding human language. In this work, we review existing studies that concern NLP for smart healthcare from the perspectives of technique and application. We first elaborate on different NLP approaches and the NLP pipeline for smart healthcare from the technical point of view. Then, in the context of smart healthcare applications employing NLP techniques, we introduce representative smart healthcare scenarios, including clinical practice, hospital management, personal care, public health, and drug development. We further discuss two specific medical issues, i.e., the coronavirus disease 2019 (COVID-19) pandemic and mental health, in which NLP-driven smart healthcare plays an important role. Finally, we discuss the limitations of current works and identify the directions for future works.
Collapse
|
24
|
Pérez-Correa I, Giunta PD, Mariño FJ, Francesconi JA. Transformer-Based Representation of Organic Molecules for Potential Modeling of Physicochemical Properties. J Chem Inf Model 2023; 63:7676-7688. [PMID: 38062559 DOI: 10.1021/acs.jcim.3c01548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2023]
Abstract
In this work, we study the use of three configurations of an autoencoder neural network to process organic substances with the aim of generating meaningful molecular descriptors that can be employed to develop property prediction models. A total of 18,322,500 compounds represented as SMILES strings were used to train the model, demonstrating that a latent space of 24 units is able to adequately reconstruct the data. After AE training, an analysis of the latent space properties in terms of compound similarity was carried out, indicating that this space possesses desired properties for the potential development of models for forecasting physical properties of organic compounds. As a final step, a QSPR model was developed to predict the boiling point of chemical substances based on the AE descriptors. 5276 substances were used for the regression task, and the predictive ability was compared with models available in the literature evaluated on the same database. The final AE model has an overall error of 1.40% (1.39% with augmented SMILES) in the prediction of the boiling temperature, while other models have errors between 2.0 and 3.2%. This shows that the SMILES representation is comparable and even outperforms the state-of-the-art representations widely used in the literature.
Collapse
Affiliation(s)
- Ignacio Pérez-Correa
- Instituto de Tecnologías del Hidrógeno y Energías Sostenibles (ITHES), UBA-CONICET, Ciudad Universitaria, Intendente Güiraldes 2160, Ciudad de Buenos Aires C1428EGA, Argentina
| | - Pablo D Giunta
- Instituto de Tecnologías del Hidrógeno y Energías Sostenibles (ITHES), UBA-CONICET, Ciudad Universitaria, Intendente Güiraldes 2160, Ciudad de Buenos Aires C1428EGA, Argentina
| | - Fernando J Mariño
- Instituto de Tecnologías del Hidrógeno y Energías Sostenibles (ITHES), UBA-CONICET, Ciudad Universitaria, Intendente Güiraldes 2160, Ciudad de Buenos Aires C1428EGA, Argentina
| | - Javier A Francesconi
- Centro de Investigación y Desarrollo en Tecnología de Alimentos (CIDTA), UTN-FRRo, Estanislao Zeballos 1341, Rosario S2000BQA, Argentina
| |
Collapse
|
25
|
Day EC, Chittari SS, Bogen MP, Knight AS. Navigating the Expansive Landscapes of Soft Materials: A User Guide for High-Throughput Workflows. ACS POLYMERS AU 2023; 3:406-427. [PMID: 38107416 PMCID: PMC10722570 DOI: 10.1021/acspolymersau.3c00025] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/02/2023] [Accepted: 11/07/2023] [Indexed: 12/19/2023]
Abstract
Synthetic polymers are highly customizable with tailored structures and functionality, yet this versatility generates challenges in the design of advanced materials due to the size and complexity of the design space. Thus, exploration and optimization of polymer properties using combinatorial libraries has become increasingly common, which requires careful selection of synthetic strategies, characterization techniques, and rapid processing workflows to obtain fundamental principles from these large data sets. Herein, we provide guidelines for strategic design of macromolecule libraries and workflows to efficiently navigate these high-dimensional design spaces. We describe synthetic methods for multiple library sizes and structures as well as characterization methods to rapidly generate data sets, including tools that can be adapted from biological workflows. We further highlight relevant insights from statistics and machine learning to aid in data featurization, representation, and analysis. This Perspective acts as a "user guide" for researchers interested in leveraging high-throughput screening toward the design of multifunctional polymers and predictive modeling of structure-property relationships in soft materials.
Collapse
Affiliation(s)
| | | | - Matthew P. Bogen
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Abigail S. Knight
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
26
|
Amoroso N, Gambacorta N, Mastrolorito F, Togo MV, Trisciuzzi D, Monaco A, Pantaleo E, Altomare CD, Ciriaco F, Nicolotti O. Making sense of chemical space network shows signs of criticality. Sci Rep 2023; 13:21335. [PMID: 38049451 PMCID: PMC10696027 DOI: 10.1038/s41598-023-48107-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 11/22/2023] [Indexed: 12/06/2023] Open
Abstract
Chemical space modelling has great importance in unveiling and visualising latent information, which is critical in predictive toxicology related to drug discovery process. While the use of traditional molecular descriptors and fingerprints may suffer from the so-called curse of dimensionality, complex networks are devoid of the typical drawbacks of coordinate-based representations. Herein, we use chemical space networks (CSNs) to analyse the case of the developmental toxicity (Dev Tox), which remains a challenging endpoint for the difficulty of gathering enough reliable data despite very important for the protection of the maternal and child health. Our study proved that the Dev Tox CSN has a complex non-random organisation and can thus provide a wealth of meaningful information also for predictive purposes. At a phase transition, chemical similarities highlight well-established toxicophores, such as aryl derivatives, mostly neurotoxic hydantoins, barbiturates and amino alcohols, steroids, and volatile organic compounds ether-like chemicals, which are strongly suspected of the Dev Tox onset and can thus be employed as effective alerts for prioritising chemicals before testing.
Collapse
Affiliation(s)
- Nicola Amoroso
- Dipartimento di Farmacia - Scienze del Farmaco, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy.
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, via E. Orabona, 4, 70125, Bari, Italy.
| | - Nicola Gambacorta
- Dipartimento di Farmacia - Scienze del Farmaco, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy
- Division of Medical Genetics, Fondazione IRCCS-Casa Sollievo della Sofferenza, San Giovanni Rotondo (Foggia), Italy
| | - Fabrizio Mastrolorito
- Dipartimento di Farmacia - Scienze del Farmaco, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy
| | - Maria Vittoria Togo
- Dipartimento di Farmacia - Scienze del Farmaco, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy
| | - Daniela Trisciuzzi
- Dipartimento di Farmacia - Scienze del Farmaco, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy
| | - Alfonso Monaco
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, via E. Orabona, 4, 70125, Bari, Italy
- Dipartimento Interateneo di Fisica "M. Merlin", Università degli studi di Bari Aldo Moro, Via Giovanni Amendola, 173, 70125, Bari, Italy
| | - Ester Pantaleo
- Istituto Nazionale di Fisica Nucleare, Sezione di Bari, via E. Orabona, 4, 70125, Bari, Italy
- Dipartimento Interateneo di Fisica "M. Merlin", Università degli studi di Bari Aldo Moro, Via Giovanni Amendola, 173, 70125, Bari, Italy
| | - Cosimo Damiano Altomare
- Dipartimento di Farmacia - Scienze del Farmaco, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy
| | - Fulvio Ciriaco
- Dipartimento di Chimica, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy.
| | - Orazio Nicolotti
- Dipartimento di Farmacia - Scienze del Farmaco, Università degli studi di Bari Aldo Moro, via E. Orabona, 4, 70125, Bari, Italy
| |
Collapse
|
27
|
Lecca P, Lecca M. Graph embedding and geometric deep learning relevance to network biology and structural chemistry. Front Artif Intell 2023; 6:1256352. [PMID: 38035201 PMCID: PMC10687447 DOI: 10.3389/frai.2023.1256352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 10/16/2023] [Indexed: 12/02/2023] Open
Abstract
Graphs are used as a model of complex relationships among data in biological science since the advent of systems biology in the early 2000. In particular, graph data analysis and graph data mining play an important role in biology interaction networks, where recent techniques of artificial intelligence, usually employed in other type of networks (e.g., social, citations, and trademark networks) aim to implement various data mining tasks including classification, clustering, recommendation, anomaly detection, and link prediction. The commitment and efforts of artificial intelligence research in network biology are motivated by the fact that machine learning techniques are often prohibitively computational demanding, low parallelizable, and ultimately inapplicable, since biological network of realistic size is a large system, which is characterised by a high density of interactions and often with a non-linear dynamics and a non-Euclidean latent geometry. Currently, graph embedding emerges as the new learning paradigm that shifts the tasks of building complex models for classification, clustering, and link prediction to learning an informative representation of the graph data in a vector space so that many graph mining and learning tasks can be more easily performed by employing efficient non-iterative traditional models (e.g., a linear support vector machine for the classification task). The great potential of graph embedding is the main reason of the flourishing of studies in this area and, in particular, the artificial intelligence learning techniques. In this mini review, we give a comprehensive summary of the main graph embedding algorithms in light of the recent burgeoning interest in geometric deep learning.
Collapse
Affiliation(s)
- Paola Lecca
- Faculty of Engineering, Free University of Bozen-Bolzano, Bolzano, Italy
| | - Michela Lecca
- Fondazione Bruno Kessler, Digital Industry Center, Technologies of Vision, Trento, Italy
| |
Collapse
|
28
|
Wu T, Tang Y, Sun Q, Xiong L. Molecular Joint Representation Learning via Multi-Modal Information of SMILES and Graphs. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3044-3055. [PMID: 37028366 DOI: 10.1109/tcbb.2023.3253862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
In recent years, artificial intelligence has played an important role on accelerating the whole process of drug discovery. Various of molecular representation schemes of different modals (e.g., textual sequence or graph) are developed. By digitally encoding them, different chemical information can be learned through corresponding network structures. Molecular graphs and Simplified Molecular Input Line Entry System (SMILES) are popular means for molecular representation learning in current. Previous works have done attempts by combining both of them to solve the problem of specific information loss in single-modal representation on various tasks. To further fusing such multi-modal imformation, the correspondence between learned chemical feature from different representation should be considered. To realize this, we propose a novel framework of molecular joint representation learning via Multi-Modal information of SMILES and molecular Graphs, called MMSG. We improve the self-attention mechanism by introducing bond-level graph representation as attention bias in Transformer to reinforce feature correspondence between multi-modal information. We further propose a Bidirectional Message Communication Graph Neural Network (BMC GNN) to strengthen the information flow aggregated from graphs for further combination. Numerous experiments on public property prediction datasets have demonstrated the effectiveness of our model.
Collapse
|
29
|
Williams AH, Zhan CG. Staying Ahead of the Game: How SARS-CoV-2 has Accelerated the Application of Machine Learning in Pandemic Management. BioDrugs 2023; 37:649-674. [PMID: 37464099 DOI: 10.1007/s40259-023-00611-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/28/2023] [Indexed: 07/20/2023]
Abstract
In recent years, machine learning (ML) techniques have garnered considerable interest for their potential use in accelerating the rate of drug discovery. With the emergence of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, the utilization of ML has become even more crucial in the search for effective antiviral medications. The pandemic has presented the scientific community with a unique challenge, and the rapid identification of potential treatments has become an urgent priority. Researchers have been able to accelerate the process of identifying drug candidates, repurposing existing drugs, and designing new compounds with desirable properties using machine learning in drug discovery. To train predictive models, ML techniques in drug discovery rely on the analysis of large datasets, including both experimental and clinical data. These models can be used to predict the biological activities, potential side effects, and interactions with specific target proteins of drug candidates. This strategy has proven to be an effective method for identifying potential coronavirus disease 2019 (COVID-19) and other disease treatments. This paper offers a thorough analysis of the various ML techniques implemented to combat COVID-19, including supervised and unsupervised learning, deep learning, and natural language processing. The paper discusses the impact of these techniques on pandemic drug development, including the identification of potential treatments, the understanding of the disease mechanism, and the creation of effective and safe therapeutics. The lessons learned can be applied to future outbreaks and drug discovery initiatives.
Collapse
Affiliation(s)
- Alexander H Williams
- Molecular Modeling and Biopharmaceutical Center, University of Kentucky, 789 South Limestone Street, Lexington, KY, 40536, USA
- Department of Pharmaceutical Sciences, College of Pharmacy, University of Kentucky, 789 South Limestone Street, Lexington, KY, 40536, USA
- GSK Upper Providence, 1250 S. Collegeville Road, Collegeville, PA, 19426, USA
| | - Chang-Guo Zhan
- Molecular Modeling and Biopharmaceutical Center, University of Kentucky, 789 South Limestone Street, Lexington, KY, 40536, USA.
- Department of Pharmaceutical Sciences, College of Pharmacy, University of Kentucky, 789 South Limestone Street, Lexington, KY, 40536, USA.
| |
Collapse
|
30
|
Grisoni F. Chemical language models for de novo drug design: Challenges and opportunities. Curr Opin Struct Biol 2023; 79:102527. [PMID: 36738564 DOI: 10.1016/j.sbi.2023.102527] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Revised: 12/07/2022] [Accepted: 12/20/2022] [Indexed: 02/05/2023]
Abstract
Generative deep learning is accelerating de novo drug design, by allowing the generation of molecules with desired properties on demand. Chemical language models - which generate new molecules in the form of strings using deep learning - have been particularly successful in this endeavour. Thanks to advances in natural language processing methods and interdisciplinary collaborations, chemical language models are expected to become increasingly relevant in drug discovery. This minireview provides an overview of the current state-of-the-art of chemical language models for de novo design, and analyses current limitations, challenges, and advantages. Finally, a perspective on future opportunities is provided.
Collapse
Affiliation(s)
- Francesca Grisoni
- Eindhoven University of Technology, Institute for Complex Molecular Systems and Dept. Biomedical Engineering, Eindhoven, Netherlands; Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Netherlands.
| |
Collapse
|
31
|
Schoenmaker L, Béquignon OJM, Jespers W, van Westen GJP. UnCorrupt SMILES: a novel approach to de novo design. J Cheminform 2023; 15:22. [PMID: 36788579 PMCID: PMC9926805 DOI: 10.1186/s13321-023-00696-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Accepted: 02/06/2023] [Indexed: 02/16/2023] Open
Abstract
Generative deep learning models have emerged as a powerful approach for de novo drug design as they aid researchers in finding new molecules with desired properties. Despite continuous improvements in the field, a subset of the outputs that sequence-based de novo generators produce cannot be progressed due to errors. Here, we propose to fix these invalid outputs post hoc. In similar tasks, transformer models from the field of natural language processing have been shown to be very effective. Therefore, here this type of model was trained to translate invalid Simplified Molecular-Input Line-Entry System (SMILES) into valid representations. The performance of this SMILES corrector was evaluated on four representative methods of de novo generation: a recurrent neural network (RNN), a target-directed RNN, a generative adversarial network (GAN), and a variational autoencoder (VAE). This study has found that the percentage of invalid outputs from these specific generative models ranges between 4 and 89%, with different models having different error-type distributions. Post hoc correction of SMILES was shown to increase model validity. The SMILES corrector trained with one error per input alters 60-90% of invalid generator outputs and fixes 35-80% of them. However, a higher error detection and performance was obtained for transformer models trained with multiple errors per input. In this case, the best model was able to correct 60-95% of invalid generator outputs. Further analysis showed that these fixed molecules are comparable to the correct molecules from the de novo generators based on novelty and similarity. Additionally, the SMILES corrector can be used to expand the amount of interesting new molecules within the targeted chemical space. Introducing different errors into existing molecules yields novel analogs with a uniqueness of 39% and a novelty of approximately 20%. The results of this research demonstrate that SMILES correction is a viable post hoc extension and can enhance the search for better drug candidates.
Collapse
Affiliation(s)
- Linde Schoenmaker
- grid.5132.50000 0001 2312 1970Computational Drug Discovery, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden, The Netherlands
| | - Olivier J. M. Béquignon
- grid.5132.50000 0001 2312 1970Computational Drug Discovery, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden, The Netherlands
| | - Willem Jespers
- grid.5132.50000 0001 2312 1970Computational Drug Discovery, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden, The Netherlands
| | - Gerard J. P. van Westen
- grid.5132.50000 0001 2312 1970Computational Drug Discovery, Drug Discovery and Safety, Leiden Academic Centre for Drug Research, Einsteinweg 55, Leiden, The Netherlands
| |
Collapse
|
32
|
Vemula D, Jayasurya P, Sushmitha V, Kumar YN, Bhandari V. CADD, AI and ML in drug discovery: A comprehensive review. Eur J Pharm Sci 2023; 181:106324. [PMID: 36347444 DOI: 10.1016/j.ejps.2022.106324] [Citation(s) in RCA: 47] [Impact Index Per Article: 23.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 10/26/2022] [Accepted: 11/03/2022] [Indexed: 11/06/2022]
Abstract
Computer-aided drug design (CADD) is an emerging field that has drawn a lot of interest because of its potential to expedite and lower the cost of the drug development process. Drug discovery research is expensive and time-consuming, and it frequently took 10-15 years for a drug to be commercially available. CADD has significantly impacted this area of research. Further, the combination of CADD with Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL) technologies to handle enormous amounts of biological data has reduced the time and cost associated with the drug development process. This review will discuss how CADD, AI, ML, and DL approaches help identify drug candidates and various other steps of the drug discovery process. It will also provide a detailed overview of the different in silico tools used and how these approaches interact.
Collapse
Affiliation(s)
- Divya Vemula
- National Institute of Pharmaceutical Education and Research- Hyderabad, India
| | - Perka Jayasurya
- National Institute of Pharmaceutical Education and Research- Hyderabad, India
| | - Varthiya Sushmitha
- National Institute of Pharmaceutical Education and Research- Hyderabad, India
| | | | - Vasundhra Bhandari
- National Institute of Pharmaceutical Education and Research- Hyderabad, India.
| |
Collapse
|
33
|
Exploring Deep Learning for Metalloporphyrins: Databases, Molecular Representations, and Model Architectures. Catalysts 2022. [DOI: 10.3390/catal12111485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Metalloporphyrins have been studied as biomimetic catalysts for more than 120 years and have accumulated a large amount of data, which provides a solid foundation for deep learning to discover chemical trends and structure–function relationships. In this study, key components of deep learning of metalloporphyrins, including databases, molecular representations, and model architectures, were systematically investigated. A protocol to construct canonical SMILES for metalloporphyrins was proposed, which was then used to represent the two-dimensional structures of over 10,000 metalloporphyrins in an existing computational database. Subsequently, several state-of-the-art chemical deep learning models, including graph neural network-based models and natural language processing-based models, were employed to predict the energy gaps of metalloporphyrins. Two models showed satisfactory predictive performance (R2 0.94) with canonical SMILES as the only source of structural information. In addition, an unsupervised visualization algorithm was used to interpret the molecular features learned by the deep learning models.
Collapse
|
34
|
Tang Q, Nie F, Zhao Q, Chen W. A merged molecular representation deep learning method for blood-brain barrier permeability prediction. Brief Bioinform 2022; 23:6674486. [PMID: 36002937 DOI: 10.1093/bib/bbac357] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2022] [Revised: 07/27/2022] [Accepted: 07/30/2022] [Indexed: 12/30/2022] Open
Abstract
The ability of a compound to permeate across the blood-brain barrier (BBB) is a significant factor for central nervous system drug development. Thus, for speeding up the drug discovery process, it is crucial to perform high-throughput screenings to predict the BBB permeability of the candidate compounds. Although experimental methods are capable of determining BBB permeability, they are still cost-ineffective and time-consuming. To complement the shortcomings of existing methods, we present a deep learning-based multi-model framework model, called Deep-B3, to predict the BBB permeability of candidate compounds. In Deep-B3, the samples are encoded in three kinds of features, namely molecular descriptors and fingerprints, molecular graph and simplified molecular input line entry system (SMILES) text notation. The pre-trained models were built to extract latent features from the molecular graph and SMILES. These features depicted the compounds in terms of tabular data, image and text, respectively. The validation results yielded from the independent dataset demonstrated that the performance of Deep-B3 is superior to that of the state-of-the-art models. Hence, Deep-B3 holds the potential to become a useful tool for drug development. A freely available online web-server for Deep-B3 was established at http://cbcb.cdutcm.edu.cn/deepb3/, and the source code and dataset of Deep-B3 are available at https://github.com/GreatChenLab/Deep-B3.
Collapse
Affiliation(s)
- Qiang Tang
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medical Science, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fulei Nie
- School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
| | - Wei Chen
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medical Science, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.,School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| |
Collapse
|
35
|
Lim S, Lee S, Piao Y, Choi M, Bang D, Gu J, Kim S. On modeling and utilizing chemical compound information with deep learning technologies: A task-oriented approach. Comput Struct Biotechnol J 2022; 20:4288-4304. [PMID: 36051875 PMCID: PMC9399946 DOI: 10.1016/j.csbj.2022.07.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2022] [Revised: 07/29/2022] [Accepted: 07/29/2022] [Indexed: 11/22/2022] Open
Abstract
A large number of chemical compounds are available in databases such as PubChem and ZINC. However, currently known compounds, though large, represent only a fraction of possible compounds, which is known as chemical space. Many of these compounds in the databases are annotated with properties and assay data that can be used for drug discovery efforts. For this goal, a number of machine learning algorithms have been developed and recent deep learning technologies can be effectively used to navigate chemical space, especially for unknown chemical compounds, in terms of drug-related tasks. In this article, we survey how deep learning technologies can model and utilize chemical compound information in a task-oriented way by exploiting annotated properties and assay data in the chemical compounds databases. We first compile what kind of tasks are trying to be accomplished by machine learning methods. Then, we survey deep learning technologies to show their modeling power and current applications for accomplishing drug related tasks. Next, we survey deep learning techniques to address the insufficiency issue of annotated data for more effective navigation of chemical space. Chemical compound information alone may not be powerful enough for drug related tasks, thus we survey what kind of information, such as assay and gene expression data, can be used to improve the prediction power of deep learning models. Finally, we conclude this survey with four important newly developed technologies that are yet to be fully incorporated into computational analysis of chemical information.
Collapse
Affiliation(s)
- Sangsoo Lim
- Bioinformatics Institute, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sangseon Lee
- Institute of Computer Technology, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Yinhua Piao
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - MinGyu Choi
- Department of Chemistry, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Dongmin Bang
- Interdisciplinary Program in Bioinformatics, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Jeonghyeon Gu
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- Interdisciplinary Program in Artificial Intelligence, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
- MOGAM Institute for Biomedical Research, Yong-in 16924, South Korea
- AIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, Seoul 08826, South Korea
| |
Collapse
|
36
|
Yoshimori A, Bajorath J. DeepAS - Chemical language model for the extension of active analogue series. Bioorg Med Chem 2022; 66:116808. [PMID: 35567984 DOI: 10.1016/j.bmc.2022.116808] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/28/2022] [Accepted: 05/04/2022] [Indexed: 11/30/2022]
Abstract
In medicinal chemistry, hit-to-lead and lead optimization efforts produce analogue series (ASs), the analysis of which is of central relevance for the exploration and exploitation of structure-activity relationships (SARs) and generation of candidate compounds. The key question in any chemical optimization effort is which analogue(s) to generate next, for which computational support is typically provided through QSAR analysis and compound potency predictions. In this study, we introduce a new chemical language model for analogue design via deep learning. For this purpose, ASs comprising active compounds are ordered according to increasing potency and the chemical language model predicts preferred R-groups for new analogues on the basis of ordered R-group sequences. Hence, consistent with the principles of deep models for natural language processing, analogues with new R-groups are predicted based upon conditional probabilities taking preceding groups into account. This implicitly accounts for the potency gradient captured by an AS and detectable SAR trends, providing a new concept for analogue design. Herein, we report the AS-based chemical language model, its initial evaluation, and exemplary applications.
Collapse
Affiliation(s)
- Atsushi Yoshimori
- Institute for Theoretical Medicine, Inc., 26-1 Muraoka-Higashi 2-chome, Fujisawa, Kanagawa 251-0012, Japan
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany.
| |
Collapse
|
37
|
Moshawih S, Goh HP, Kifli N, Idris AC, Yassin H, Kotra V, Goh KW, Liew KB, Ming LC. Synergy between machine learning and natural products cheminformatics: Application to the lead discovery of anthraquinone derivatives. Chem Biol Drug Des 2022; 100:185-217. [PMID: 35490393 DOI: 10.1111/cbdd.14062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 04/15/2022] [Accepted: 04/23/2022] [Indexed: 11/28/2022]
Abstract
Cheminformatics utilizing machine learning (ML) techniques have opened up a new horizon in drug discovery. This is owing to vast chemical space expansion with rocketing numbers of expected hits and lead compounds that match druggable macromolecular targets, in particular from natural compounds. Due to the natural products' (NP) structural complexity, uniqueness, and diversity, they could occupy a bigger space in pharmaceuticals, allowing the industry to pursue more selective leads in the nanomolar range of binding affinity. ML is an essential part of each step of the drug design pipeline, such as target prediction, compound library preparation, and lead optimization. Notably, molecular mechanic and dynamic simulations, induced docking, and free energy perturbations are essential in predicting best binding poses, binding free energy values, and molecular mechanics force fields. Those applications have leveraged from artificial intelligence (AI), which decreases the computational costs required for such costly simulations. This review aimed to describe chemical space and compound libraries related to NPs. High-throughput screening utilized for fractionating NPs and high-throughput virtual screening and their strategies, and significance, are reviewed. Particular emphasis was given to AI approaches, ML tools, algorithms, and techniques, especially in drug discovery of macrocyclic compounds and approaches in computer-aided and ML-based drug discovery. Anthraquinone derivatives were discussed as a source of new lead compounds that can be developed using ML tools for diverse medicinal uses such as cancer, infectious diseases, and metabolic disorders. Furthermore, the power of principal component analysis in understanding relevant protein conformations, and molecular modeling of protein-ligand interaction were also presented. Apart from being a concise reference for cheminformatics, this review is a useful text to understand the application of ML-based algorithms to molecular dynamics simulation and in silico absorption, distribution, metabolism, excretion, and toxicity prediction.
Collapse
Affiliation(s)
- Said Moshawih
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Hui Poh Goh
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Nurolaini Kifli
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Azam Che Idris
- Faculty of Integrated Technologies, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Hayati Yassin
- Faculty of Integrated Technologies, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| | - Vijay Kotra
- Faculty of Pharmacy, Quest International University, Perak, Malaysia
| | - Khang Wen Goh
- Faculty of Data Science and Information Technology, INTI International University, Nilai, Malaysia
| | - Kai Bin Liew
- Faculty of Pharmacy, University of Cyberjaya, Cyberjaya, Malaysia
| | - Long Chiau Ming
- PAP Rashidah Sa'adatul Bolkiah Institute of Health Sciences, Universiti Brunei Darussalam, Gadong, Brunei Darussalam
| |
Collapse
|
38
|
Bhatnagar R, Sardar S, Beheshti M, Podichetty JT. How can natural language processing help model informed drug development?: a review. JAMIA Open 2022; 5:ooac043. [PMID: 35702625 PMCID: PMC9188322 DOI: 10.1093/jamiaopen/ooac043] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/28/2022] [Accepted: 05/26/2022] [Indexed: 01/20/2023] Open
Abstract
Objective To summarize applications of natural language processing (NLP) in model informed drug development (MIDD) and identify potential areas of improvement. Materials and Methods Publications found on PubMed and Google Scholar, websites and GitHub repositories for NLP libraries and models. Publications describing applications of NLP in MIDD were reviewed. The applications were stratified into 3 stages: drug discovery, clinical trials, and pharmacovigilance. Key NLP functionalities used for these applications were assessed. Programming libraries and open-source resources for the implementation of NLP functionalities in MIDD were identified. Results NLP has been utilized to aid various processes in drug development lifecycle such as gene-disease mapping, biomarker discovery, patient-trial matching, adverse drug events detection, etc. These applications commonly use NLP functionalities of named entity recognition, word embeddings, entity resolution, assertion status detection, relation extraction, and topic modeling. The current state-of-the-art for implementing these functionalities in MIDD applications are transformer models that utilize transfer learning for enhanced performance. Various libraries in python, R, and Java like huggingface, sparkNLP, and KoRpus as well as open-source platforms such as DisGeNet, DeepEnroll, and Transmol have enabled convenient implementation of NLP models to MIDD applications. Discussion Challenges such as reproducibility, explainability, fairness, limited data, limited language-support, and security need to be overcome to ensure wider adoption of NLP in MIDD landscape. There are opportunities to improve the performance of existing models and expand the use of NLP in newer areas of MIDD. Conclusions This review provides an overview of the potential and pitfalls of current NLP approaches in MIDD.
Collapse
Affiliation(s)
- Roopal Bhatnagar
- Data Science, Data Collaboration Center, Critical Path Institute , Tucson, Arizona, USA
| | - Sakshi Sardar
- Quantitative Medicine, Critical Path Institute , Tucson, Arizona, USA
| | - Maedeh Beheshti
- Quantitative Medicine, Critical Path Institute , Tucson, Arizona, USA
| | | |
Collapse
|
39
|
Shi W, Singha M, Srivastava G, Pu L, Ramanujam J, Brylinski M. Pocket2Drug: An Encoder-Decoder Deep Neural Network for the Target-Based Drug Design. Front Pharmacol 2022; 13:837715. [PMID: 35359869 PMCID: PMC8962739 DOI: 10.3389/fphar.2022.837715] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2021] [Accepted: 02/10/2022] [Indexed: 11/13/2022] Open
Abstract
Computational modeling is an essential component of modern drug discovery. One of its most important applications is to select promising drug candidates for pharmacologically relevant target proteins. Because of continuing advances in structural biology, putative binding sites for small organic molecules are being discovered in numerous proteins linked to various diseases. These valuable data offer new opportunities to build efficient computational models predicting binding molecules for target sites through the application of data mining and machine learning. In particular, deep neural networks are powerful techniques capable of learning from complex data in order to make informed drug binding predictions. In this communication, we describe Pocket2Drug, a deep graph neural network model to predict binding molecules for a given a ligand binding site. This approach first learns the conditional probability distribution of small molecules from a large dataset of pocket structures with supervised training, followed by the sampling of drug candidates from the trained model. Comprehensive benchmarking simulations show that using Pocket2Drug significantly improves the chances of finding molecules binding to target pockets compared to traditional drug selection procedures. Specifically, known binders are generated for as many as 80.5% of targets present in the testing set consisting of dissimilar data from that used to train the deep graph neural network model. Overall, Pocket2Drug is a promising computational approach to inform the discovery of novel biopharmaceuticals.
Collapse
Affiliation(s)
- Wentao Shi
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, United States
| | - Manali Singha
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, United States
| | - Gopal Srivastava
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, United States
| | - Limeng Pu
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, United States
| | - J. Ramanujam
- Division of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA, United States
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, United States
| | - Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, United States
- Center for Computation and Technology, Louisiana State University, Baton Rouge, LA, United States
- *Correspondence: Michal Brylinski,
| |
Collapse
|
40
|
Saldívar-González FI, Aldas-Bulos VD, Medina-Franco JL, Plisson F. Natural product drug discovery in the artificial intelligence era. Chem Sci 2022; 13:1526-1546. [PMID: 35282622 PMCID: PMC8827052 DOI: 10.1039/d1sc04471k] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Accepted: 12/10/2021] [Indexed: 12/19/2022] Open
Abstract
Natural products (NPs) are primarily recognized as privileged structures to interact with protein drug targets. Their unique characteristics and structural diversity continue to marvel scientists for developing NP-inspired medicines, even though the pharmaceutical industry has largely given up. High-performance computer hardware, extensive storage, accessible software and affordable online education have democratized the use of artificial intelligence (AI) in many sectors and research areas. The last decades have introduced natural language processing and machine learning algorithms, two subfields of AI, to tackle NP drug discovery challenges and open up opportunities. In this article, we review and discuss the rational applications of AI approaches developed to assist in discovering bioactive NPs and capturing the molecular "patterns" of these privileged structures for combinatorial design or target selectivity.
Collapse
Affiliation(s)
- F I Saldívar-González
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - V D Aldas-Bulos
- Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| | - J L Medina-Franco
- DIFACQUIM Research Group, School of Chemistry, Department of Pharmacy, Universidad Nacional Autónoma de México Avenida Universidad 3000 04510 Mexico Mexico
| | - F Plisson
- CONACYT - Unidad de Genómica Avanzada, Laboratorio Nacional de Genómica para la Biodiversidad (Langebio), Centro de Investigación y de Estudios Avanzados del IPN Irapuato Guanajuato Mexico
| |
Collapse
|
41
|
Xu J, Zhang Y, Han J, Su A, Qiao H, Zhang C, Tang J, Shen X, Sun B, Yu W, Zhai S, Wang X, Wu Y, Su W, Duan H. Providing direction for mechanistic inferences in radical cascade cyclization using Transformer model. Org Chem Front 2022. [DOI: 10.1039/d2qo00188h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Even in modern organic chemistry, predicting or proposing a reaction mechanism and speculating on reaction intermediates remains challenging. For example, it is challenging to predict the regioselectivity of radical attraction...
Collapse
|
42
|
Yin Z, Wong STC. Artificial intelligence unifies knowledge and actions in drug repositioning. Emerg Top Life Sci 2021; 5:803-813. [PMID: 34881780 PMCID: PMC8923082 DOI: 10.1042/etls20210223] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 11/08/2021] [Accepted: 11/09/2021] [Indexed: 11/17/2022]
Abstract
Drug repositioning aims to reuse existing drugs, shelved drugs, or drug candidates that failed clinical trials for other medical indications. Its attraction is sprung from the reduction in risk associated with safety testing of new medications and the time to get a known drug into the clinics. Artificial Intelligence (AI) has been recently pursued to speed up drug repositioning and discovery. The essence of AI in drug repositioning is to unify the knowledge and actions, i.e. incorporating real-world and experimental data to map out the best way forward to identify effective therapeutics against a disease. In this review, we share positive expectations for the evolution of AI and drug repositioning and summarize the role of AI in several methods of drug repositioning.
Collapse
Affiliation(s)
- Zheng Yin
- Department of Systems Medicine and Bioengineering, Houston Methodist Cancer Center and Ting Tsung & Wei Fong Chao Center for BRAIN, Houston Methodist Research Institute, Weill Cornell Medicine, Houston, TX 77030, U.S.A
| | - Stephen T C Wong
- Department of Systems Medicine and Bioengineering, Houston Methodist Cancer Center and Ting Tsung & Wei Fong Chao Center for BRAIN, Houston Methodist Research Institute, Weill Cornell Medicine, Houston, TX 77030, U.S.A
| |
Collapse
|
43
|
|
44
|
Krishnan SR, Bung N, Vangala SR, Srinivasan R, Bulusu G, Roy A. De Novo Structure-Based Drug Design Using Deep Learning. J Chem Inf Model 2021; 62:5100-5109. [PMID: 34792338 DOI: 10.1021/acs.jcim.1c01319] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
In recent years, deep learning-based methods have emerged as promising tools for de novo drug design. Most of these methods are ligand-based, where an initial target-specific ligand data set is necessary to design potent molecules with optimized properties. Although there have been attempts to develop alternative ways to design target-specific ligand data sets, availability of such data sets remains a challenge while designing molecules against novel target proteins. In this work, we propose a deep learning-based method, where the knowledge of the active site structure of the target protein is sufficient to design new molecules. First, a graph attention model was used to learn the structure and features of the amino acids in the active site of proteins that are experimentally known to form protein-ligand complexes. Next, the learned active site features were used along with a pretrained generative model for conditional generation of new molecules. A bioactivity prediction model was then used in a reinforcement learning framework to optimize the conditional generative model. We validated our method against two well-studied proteins, Janus kinase 2 (JAK2) and dopamine receptor D2 (DRD2), where we produce molecules similar to the known inhibitors. The graph attention model could identify the probable key active site residues, which influenced the conditional molecule generator to design new molecules with pharmacophoric features similar to the known inhibitors.
Collapse
Affiliation(s)
| | - Navneet Bung
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad 500081, India
| | - Sarveswara Rao Vangala
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad 500081, India
| | - Rajgopal Srinivasan
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad 500081, India
| | - Gopalakrishnan Bulusu
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad 500081, India
| | - Arijit Roy
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Hyderabad 500081, India
| |
Collapse
|
45
|
Deng J, Yang Z, Ojima I, Samaras D, Wang F. Artificial intelligence in drug discovery: applications and techniques. Brief Bioinform 2021; 23:6420092. [PMID: 34734228 DOI: 10.1093/bib/bbab430] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 08/02/2021] [Accepted: 09/18/2021] [Indexed: 12/23/2022] Open
Abstract
Artificial intelligence (AI) has been transforming the practice of drug discovery in the past decade. Various AI techniques have been used in many drug discovery applications, such as virtual screening and drug design. In this survey, we first give an overview on drug discovery and discuss related applications, which can be reduced to two major tasks, i.e. molecular property prediction and molecule generation. We then present common data resources, molecule representations and benchmark platforms. As a major part of the survey, AI techniques are dissected into model architectures and learning paradigms. To reflect the technical development of AI in drug discovery over the years, the surveyed works are organized chronologically. We expect that this survey provides a comprehensive review on AI in drug discovery. We also provide a GitHub repository with a collection of papers (and codes, if applicable) as a learning resource, which is regularly updated.
Collapse
Affiliation(s)
- Jianyuan Deng
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11790, USA
| | - Zhibo Yang
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11790, USA
| | - Iwao Ojima
- Department of Chemistry, Stony Brook University, Stony Brook, NY 11790, USA
| | - Dimitris Samaras
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11790, USA
| | - Fusheng Wang
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY 11790, USA.,Department of Computer Science, Stony Brook University, Stony Brook, NY 11790, USA
| |
Collapse
|
46
|
Williams W, Zeng L, Gensch T, Sigman MS, Doyle AG, Anslyn EV. The Evolution of Data-Driven Modeling in Organic Chemistry. ACS CENTRAL SCIENCE 2021; 7:1622-1637. [PMID: 34729406 PMCID: PMC8554870 DOI: 10.1021/acscentsci.1c00535] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Indexed: 05/14/2023]
Abstract
Organic chemistry is replete with complex relationships: for example, how a reactant's structure relates to the resulting product formed; how reaction conditions relate to yield; how a catalyst's structure relates to enantioselectivity. Questions like these are at the foundation of understanding reactivity and developing novel and improved reactions. An approach to probing these questions that is both longstanding and contemporary is data-driven modeling. Here, we provide a synopsis of the history of data-driven modeling in organic chemistry and the terms used to describe these endeavors. We include a timeline of the steps that led to its current state. The case studies included highlight how, as a community, we have advanced physical organic chemistry tools with the aid of computers and data to augment the intuition of expert chemists and to facilitate the prediction of structure-activity and structure-property relationships.
Collapse
Affiliation(s)
- Wendy
L. Williams
- Department
of Chemistry and Biochemistry, University
of California, Los Angeles, California 90095, United States
- Department
of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Lingyu Zeng
- Department
of Chemistry, The University of Texas at
Austin, Austin, Texas 78712, United States
| | - Tobias Gensch
- Department
of Chemistry, TU Berlin, Straße des 17. Juni 135, Sekr. C2, 10623 Berlin, Germany
| | - Matthew S. Sigman
- Department
of Chemistry, University of Utah, Salt Lake City, Utah 84112, United States
| | - Abigail G. Doyle
- Department
of Chemistry and Biochemistry, University
of California, Los Angeles, California 90095, United States
- Department
of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Eric V. Anslyn
- Department
of Chemistry, The University of Texas at
Austin, Austin, Texas 78712, United States
| |
Collapse
|
47
|
Zhang S. Language Processing Model Construction and Simulation Based on Hybrid CNN and LSTM. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2021; 2021:2578422. [PMID: 34306049 PMCID: PMC8279871 DOI: 10.1155/2021/2578422] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 06/29/2021] [Indexed: 11/20/2022]
Abstract
Deep learning is the latest trend of machine learning and artificial intelligence research. As a new field with rapid development over the past decade, it has attracted more and more researchers' attention. Convolutional Neural Network (CNN) model is one of the most important classical structures in deep learning models, and its performance has been gradually improved in deep learning tasks in recent years. Convolutional neural networks have been widely used in image classification, target detection, semantic segmentation, and natural language processing because they can automatically learn the feature representation of sample data. Firstly, this paper analyzes the model structure of a typical convolutional neural network model to increase the network depth and width in order to improve its performance, analyzes the network structure that further improves the model performance by using the attention mechanism, and then summarizes and analyzes the current special model structure. In order to further improve the text language processing effect, a convolutional neural network model, Hybrid convolutional neural network (CNN), and Long Short-Term Memory (LSTM) based on the fusion of text features and language knowledge are proposed. The text features and language knowledge are integrated into the language processing model, and the accuracy of the text language processing model is improved by parameter optimization. Experimental results on data sets show that the accuracy of the proposed model reaches 93.0%, which is better than the reference model in the literature.
Collapse
Affiliation(s)
- Shujing Zhang
- Faculty of International Studies, Henan Normal University, Xinxiang, Henan 453000, China
| |
Collapse
|
48
|
Bitterman DS, Miller TA, Mak RH, Savova GK. Clinical Natural Language Processing for Radiation Oncology: A Review and Practical Primer. Int J Radiat Oncol Biol Phys 2021; 110:641-655. [PMID: 33545300 DOI: 10.1016/j.ijrobp.2021.01.044] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 12/22/2020] [Accepted: 01/23/2021] [Indexed: 02/07/2023]
Abstract
Natural language processing (NLP), which aims to convert human language into expressions that can be analyzed by computers, is one of the most rapidly developing and widely used technologies in the field of artificial intelligence. Natural language processing algorithms convert unstructured free text data into structured data that can be extracted and analyzed at scale. In medicine, this unlocking of the rich, expressive data within clinical free text in electronic medical records will help untap the full potential of big data for research and clinical purposes. Recent major NLP algorithmic advances have significantly improved the performance of these algorithms, leading to a surge in academic and industry interest in developing tools to automate information extraction and phenotyping from clinical texts. Thus, these technologies are poised to transform medical research and alter clinical practices in the future. Radiation oncology stands to benefit from NLP algorithms if they are appropriately developed and deployed, as they may enable advances such as automated inclusion of radiation therapy details into cancer registries, discovery of novel insights about cancer care, and improved patient data curation and presentation at the point of care. However, challenges remain before the full value of NLP is realized, such as the plethora of jargon specific to radiation oncology, nonstandard nomenclature, a lack of publicly available labeled data for model development, and interoperability limitations between radiation oncology data silos. Successful development and implementation of high quality and high value NLP models for radiation oncology will require close collaboration between computer scientists and the radiation oncology community. Here, we present a primer on artificial intelligence algorithms in general and NLP algorithms in particular; provide guidance on how to assess the performance of such algorithms; review prior research on NLP algorithms for oncology; and describe future avenues for NLP in radiation oncology research and clinics.
Collapse
Affiliation(s)
- Danielle S Bitterman
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, Massachusetts; Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts; Artificial Intelligence in Medicine Program, Brigham and Women's Hospital, Boston, Massachusetts.
| | - Timothy A Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
| | - Raymond H Mak
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, Massachusetts; Artificial Intelligence in Medicine Program, Brigham and Women's Hospital, Boston, Massachusetts
| | - Guergana K Savova
- Computational Health Informatics Program, Boston Children's Hospital, Boston, Massachusetts
| |
Collapse
|
49
|
Wu Y, Zhang C, Wang L, Duan H. A graph-convolutional neural network for addressing small-scale reaction prediction. Chem Commun (Camb) 2021; 57:4114-4117. [PMID: 33908460 DOI: 10.1039/d1cc00586c] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
We describe a graph-convolutional neural network (GCN) model, the reaction prediction capabilities of which are as potent as those of the transformer model based on sufficient data, and we adopt the Baeyer-Villiger oxidation reaction to explore their performance differences based on limited data. The top-1 accuracy of the GCN model (90.4%) is higher than that of the transformer model (58.4%).
Collapse
Affiliation(s)
- Yejian Wu
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China.
| | - Chengyun Zhang
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China.
| | - Ling Wang
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China.
| | - Hongliang Duan
- Artificial Intelligence Aided Drug Discovery Institute, College of Pharmaceutical Sciences, Zhejiang University of Technology, Hangzhou 310014, China.
| |
Collapse
|
50
|
Schwaller P, Hoover B, Reymond JL, Strobelt H, Laino T. Extraction of organic chemistry grammar from unsupervised learning of chemical reactions. SCIENCE ADVANCES 2021; 7:7/15/eabe4166. [PMID: 33827815 PMCID: PMC8026122 DOI: 10.1126/sciadv.abe4166] [Citation(s) in RCA: 87] [Impact Index Per Article: 21.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Accepted: 02/03/2021] [Indexed: 05/07/2023]
Abstract
Humans use different domain languages to represent, explore, and communicate scientific concepts. During the last few hundred years, chemists compiled the language of chemical synthesis inferring a series of "reaction rules" from knowing how atoms rearrange during a chemical transformation, a process called atom-mapping. Atom-mapping is a laborious experimental task and, when tackled with computational methods, requires continuous annotation of chemical reactions and the extension of logically consistent directives. Here, we demonstrate that Transformer Neural Networks learn atom-mapping information between products and reactants without supervision or human labeling. Using the Transformer attention weights, we build a chemically agnostic, attention-guided reaction mapper and extract coherent chemical grammar from unannotated sets of reactions. Our method shows remarkable performance in terms of accuracy and speed, even for strongly imbalanced and chemically complex reactions with nontrivial atom-mapping. It provides the missing link between data-driven and rule-based approaches for numerous chemical reaction tasks.
Collapse
Affiliation(s)
- Philippe Schwaller
- IBM Research Europe, CH-8803 Rüschlikon, Switzerland.
- Department of Chemistry and Biochemistry, University of Bern, Switzerland
| | - Benjamin Hoover
- MIT-IBM Watson AI Lab, IBM Research Cambridge, Cambridge, MA 02142, USA
| | - Jean-Louis Reymond
- Department of Chemistry and Biochemistry, University of Bern, Switzerland
| | - Hendrik Strobelt
- MIT-IBM Watson AI Lab, IBM Research Cambridge, Cambridge, MA 02142, USA
| | - Teodoro Laino
- IBM Research Europe, CH-8803 Rüschlikon, Switzerland
| |
Collapse
|