1
|
Lu X, Xie L, Xu L, Mao R, Xu X, Chang S. Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph. Comput Struct Biotechnol J 2024; 23:1666-1679. [PMID: 38680871 PMCID: PMC11046066 DOI: 10.1016/j.csbj.2024.04.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 04/01/2024] [Accepted: 04/10/2024] [Indexed: 05/01/2024] Open
Abstract
Accurately predicting molecular properties is a challenging but essential task in drug discovery. Recently, many mono-modal deep learning methods have been successfully applied to molecular property prediction. However, mono-modal learning is inherently limited as it relies solely on a single modality of molecular representation, which restricts a comprehensive understanding of drug molecules. To overcome the limitations, we propose a multimodal fused deep learning (MMFDL) model to leverage information from different molecular representations. Specifically, we construct a triple-modal learning model by employing Transformer-Encoder, Bidirectional Gated Recurrent Unit (BiGRU), and graph convolutional network (GCN) to process three modalities of information from chemical language and molecular graph: SMILES-encoded vectors, ECFP fingerprints, and molecular graphs, respectively. We evaluate the proposed triple-modal model using five fusion approaches on six molecule datasets, including Delaney, Llinas2020, Lipophilicity, SAMPL, BACE, and pKa from DataWarrior. The results show that the MMFDL model achieves the highest Pearson coefficients, and stable distribution of Pearson coefficients in the random splitting test, outperforming mono-modal models in accuracy and reliability. Furthermore, we validate the generalization ability of our model in the prediction of binding constants for protein-ligand complex molecules, and assess the resilience capability against noise. Through analysis of feature distributions in chemical space and the assigned contribution of each modal model, we demonstrate that the MMFDL model shows the ability to acquire complementary information by using proper models and suitable fusion approaches. By leveraging diverse sources of bioinformatics information, multimodal deep learning models hold the potential for successful drug discovery.
Collapse
Affiliation(s)
- Xiaohua Lu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Liangxu Xie
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Lei Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Rongzhi Mao
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Xiaojun Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou 213001, China
| |
Collapse
|
2
|
Odugbemi AI, Nyirenda C, Christoffels A, Egieyeh SA. Artificial intelligence in antidiabetic drug discovery: The advances in QSAR and the prediction of α-glucosidase inhibitors. Comput Struct Biotechnol J 2024; 23:2964-2977. [PMID: 39148608 PMCID: PMC11326494 DOI: 10.1016/j.csbj.2024.07.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 07/03/2024] [Accepted: 07/03/2024] [Indexed: 08/17/2024] Open
Abstract
Artificial Intelligence is transforming drug discovery, particularly in the hit identification phase of therapeutic compounds. One tool that has been instrumental in this transformation is Quantitative Structure-Activity Relationship (QSAR) analysis. This computer-aided drug design tool uses machine learning to predict the biological activity of new compounds based on the numerical representation of chemical structures against various biological targets. With diabetes mellitus becoming a significant health challenge in recent times, there is intense research interest in modulating antidiabetic drug targets. α-Glucosidase is an antidiabetic target that has gained attention due to its ability to suppress postprandial hyperglycaemia, a key contributor to diabetic complications. This review explored a detailed approach to developing QSAR models, focusing on strategies for generating input variables (molecular descriptors) and computational approaches ranging from classical machine learning algorithms to modern deep learning algorithms. We also highlighted studies that have used these approaches to develop predictive models for α-glucosidase inhibitors to modulate this critical antidiabetic drug target.
Collapse
Affiliation(s)
- Adeshina I Odugbemi
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, Cape Town 7535, South Africa
- School of Pharmacy, University of the Western Cape, Bellville, Cape Town 7535, South Africa
- National Institute for Theoretical and Computational Sciences (NITheCS), South Africa
| | - Clement Nyirenda
- Department of Computer Science, University of the Western Cape, Cape Town 7535, South Africa
| | - Alan Christoffels
- South African Medical Research Council Bioinformatics Unit, South African National Bioinformatics Institute, University of the Western Cape, Bellville, Cape Town 7535, South Africa
- Africa Centres for Disease Control and Prevention, African Union, Addis Ababa, Ethiopia
| | - Samuel A Egieyeh
- School of Pharmacy, University of the Western Cape, Bellville, Cape Town 7535, South Africa
- National Institute for Theoretical and Computational Sciences (NITheCS), South Africa
| |
Collapse
|
3
|
Fried ZTP, McGuire BA. Automated Mixture Analysis via Structural Evaluation. J Phys Chem A 2024; 128:8254-8264. [PMID: 39264124 DOI: 10.1021/acs.jpca.4c03580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
The determination of chemical mixture components is vital to a multitude of scientific fields. Oftentimes spectroscopic methods are employed to decipher the composition of these mixtures. However, the sheer density of spectral features present in spectroscopic databases can make unambiguous assignment to individual species challenging. Yet, components of a mixture are commonly chemically related due to environmental processes or shared precursor molecules. Therefore, analysis of the chemical relevance of a molecule is important when determining which species are present in a mixture. In this paper, we combine machine-learning molecular embedding methods with a graph-based ranking system to determine the likelihood of a molecule being present in a mixture based on the other known species and/or chemical priors. By incorporating this metric in a rotational spectroscopy mixture analysis algorithm, we demonstrate that the mixture components can be identified with extremely high accuracy (≥97%) in an efficient manner.
Collapse
Affiliation(s)
- Zachary T P Fried
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Brett A McGuire
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
- National Radio Astronomy Observatory, Charlottesville, Virginia 22903, United States
| |
Collapse
|
4
|
Guichaoua G, Pinel P, Hoffmann B, Azencott CA, Stoven V. Drug-Target Interactions Prediction at Scale: The Komet Algorithm with the LCIdb Dataset. J Chem Inf Model 2024; 64:6938-6956. [PMID: 39237105 DOI: 10.1021/acs.jcim.4c00422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2024]
Abstract
Drug-target interactions (DTIs) prediction algorithms are used at various stages of the drug discovery process. In this context, specific problems such as deorphanization of a new therapeutic target or target identification of a drug candidate arising from phenotypic screens require large-scale predictions across the protein and molecule spaces. DTI prediction heavily relies on supervised learning algorithms that use known DTIs to learn associations between molecule and protein features, allowing for the prediction of new interactions based on learned patterns. The algorithms must be broadly applicable to enable reliable predictions, even in regions of the protein or molecule spaces where data may be scarce. In this paper, we address two key challenges to fulfill these goals: building large, high-quality training datasets and designing prediction methods that can scale, in order to be trained on such large datasets. First, we introduce LCIdb, a curated, large-sized dataset of DTIs, offering extensive coverage of both the molecule and druggable protein spaces. Notably, LCIdb contains a much higher number of molecules than publicly available benchmarks, expanding coverage of the molecule space. Second, we propose Komet (Kronecker Optimized METhod), a DTI prediction pipeline designed for scalability without compromising performance. Komet leverages a three-step framework, incorporating efficient computation choices tailored for large datasets and involving the Nyström approximation. Specifically, Komet employs a Kronecker interaction module for (molecule, protein) pairs, which efficiently captures determinants in DTIs, and whose structure allows for reduced computational complexity and quasi-Newton optimization, ensuring that the model can handle large training sets, without compromising on performance. Our method is implemented in open-source software, leveraging GPU parallel computation for efficiency. We demonstrate the interest of our pipeline on various datasets, showing that Komet displays superior scalability and prediction performance compared to state-of-the-art deep learning approaches. Additionally, we illustrate the generalization properties of Komet by showing its performance on an external dataset, and on the publicly available L H benchmark designed for scaffold hopping problems. Komet is available open source at https://komet.readthedocs.io and all datasets, including LCIdb, can be found at https://zenodo.org/records/10731712.
Collapse
Affiliation(s)
- Gwenn Guichaoua
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| | - Philippe Pinel
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
- Iktos SAS, 75017 Paris, France
| | | | - Chloé-Agathe Azencott
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| | - Véronique Stoven
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| |
Collapse
|
5
|
Fu X, Cheng W, Wan G, Yang Z, Tee BCK. Toward an AI Era: Advances in Electronic Skins. Chem Rev 2024; 124:9899-9948. [PMID: 39198214 PMCID: PMC11397144 DOI: 10.1021/acs.chemrev.4c00049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2024]
Abstract
Electronic skins (e-skins) have seen intense research and rapid development in the past two decades. To mimic the capabilities of human skin, a multitude of flexible/stretchable sensors that detect physiological and environmental signals have been designed and integrated into functional systems. Recently, researchers have increasingly deployed machine learning and other artificial intelligence (AI) technologies to mimic the human neural system for the processing and analysis of sensory data collected by e-skins. Integrating AI has the potential to enable advanced applications in robotics, healthcare, and human-machine interfaces but also presents challenges such as data diversity and AI model robustness. In this review, we first summarize the functions and features of e-skins, followed by feature extraction of sensory data and different AI models. Next, we discuss the utilization of AI in the design of e-skin sensors and address the key topic of AI implementation in data processing and analysis of e-skins to accomplish a range of different tasks. Subsequently, we explore hardware-layer in-skin intelligence before concluding with an analysis of the challenges and opportunities in the various aspects of AI-enabled e-skins.
Collapse
Affiliation(s)
- Xuemei Fu
- Department of Materials Science and Engineering, National University of Singapore, Singapore 117575, Singapore
- Institute for Health Innovation & Technology, National University of Singapore, Singapore 119276, Singapore
| | - Wen Cheng
- Department of Materials Science and Engineering, National University of Singapore, Singapore 117575, Singapore
- Institute for Health Innovation & Technology, National University of Singapore, Singapore 119276, Singapore
- The N.1 Institute for Health, National University of Singapore, Singapore 117456, Singapore
| | - Guanxiang Wan
- Department of Materials Science and Engineering, National University of Singapore, Singapore 117575, Singapore
- Institute for Health Innovation & Technology, National University of Singapore, Singapore 119276, Singapore
| | - Zijie Yang
- Department of Materials Science and Engineering, National University of Singapore, Singapore 117575, Singapore
- Institute for Health Innovation & Technology, National University of Singapore, Singapore 119276, Singapore
| | - Benjamin C K Tee
- Department of Materials Science and Engineering, National University of Singapore, Singapore 117575, Singapore
- Institute for Health Innovation & Technology, National University of Singapore, Singapore 119276, Singapore
- Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117583, Singapore
- The N.1 Institute for Health, National University of Singapore, Singapore 117456, Singapore
- Institute of Materials Research and Engineering, Agency for Science Technology and Research, Singapore 138634, Singapore
| |
Collapse
|
6
|
Schmid SP, Schlosser L, Glorius F, Jorner K. Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis. Beilstein J Org Chem 2024; 20:2280-2304. [PMID: 39290209 PMCID: PMC11406055 DOI: 10.3762/bjoc.20.196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Accepted: 08/09/2024] [Indexed: 09/19/2024] Open
Abstract
Organocatalysis has established itself as a third pillar of homogeneous catalysis, besides transition metal catalysis and biocatalysis, as its use for enantioselective reactions has gathered significant interest over the last decades. Concurrent to this development, machine learning (ML) has been increasingly applied in the chemical domain to efficiently uncover hidden patterns in data and accelerate scientific discovery. While the uptake of ML in organocatalysis has been comparably slow, the last two decades have showed an increased interest from the community. This review gives an overview of the work in the field of ML in organocatalysis. The review starts by giving a short primer on ML for experimental chemists, before discussing its application for predicting the selectivity of organocatalytic transformations. Subsequently, we review ML employed for privileged catalysts, before focusing on its application for catalyst and reaction design. Concluding, we give our view on current challenges and future directions for this field, drawing inspiration from the application of ML to other scientific domains.
Collapse
Affiliation(s)
- Stefan P Schmid
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich CH-8093, Switzerland
| | - Leon Schlosser
- Organisch-Chemisches Institut, Universität Münster, 48149 Münster, Germany
| | - Frank Glorius
- Organisch-Chemisches Institut, Universität Münster, 48149 Münster, Germany
| | - Kjell Jorner
- Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich CH-8093, Switzerland
- National Centre of Competence in Research (NCCR) Catalysis, ETH Zurich, Zurich CH-8093, Switzerland
| |
Collapse
|
7
|
Bhattacharya D, Cassady HJ, Hickner MA, Reinhart WF. Large Language Models as Molecular Design Engines. J Chem Inf Model 2024. [PMID: 39231030 DOI: 10.1021/acs.jcim.4c01396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]
Abstract
The design of small molecules is crucial for technological applications ranging from drug discovery to energy storage. Due to the vast design space available to modern synthetic chemistry, the community has increasingly sought to use data-driven and machine learning approaches to navigate this space. Although generative machine learning methods have recently shown potential for computational molecular design, their use is hindered by complex training procedures, and they often fail to generate valid and unique molecules. In this context, pretrained Large Language Models (LLMs) have emerged as potential tools for molecular design, as they appear to be capable of creating and modifying molecules based on simple instructions provided through natural language prompts. In this work, we show that the Claude 3 Opus LLM can read, write, and modify molecules according to prompts, with impressive 97% valid and unique molecules. By quantifying these modifications in a low-dimensional latent space, we systematically evaluate the model's behavior under different prompting conditions. Notably, the model is able to perform guided molecular generation when asked to manipulate the electronic structure of molecules using simple, natural-language prompts. Our findings highlight the potential of LLMs as powerful and versatile molecular design engines.
Collapse
Affiliation(s)
- Debjyoti Bhattacharya
- Materials Science and Engineering, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Harrison J Cassady
- Department of Chemical Engineering and Material Science, Michigan State University, East Lansing, Michigan 48824, United States
| | - Michael A Hickner
- Department of Chemical Engineering and Material Science, Michigan State University, East Lansing, Michigan 48824, United States
| | - Wesley F Reinhart
- Materials Science and Engineering, Pennsylvania State University, University Park, Pennsylvania 16802, United States
- Institute for Computational and Data Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| |
Collapse
|
8
|
Xi R, Liu H, Liu X, Zhao X. Predicting and screening high-performance polyimide membranes using negative correlation based deep ensemble methods. ANALYTICAL METHODS : ADVANCING METHODS AND APPLICATIONS 2024; 16:5845-5863. [PMID: 39145470 DOI: 10.1039/d4ay01160k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/16/2024]
Abstract
Polyimide polymer membranes have become critical materials in gas separation and storage applications due to their high selectivity and excellent permeability. However, with over 107 known types of polyimides, relying solely on experimental research means potential high-performance candidates are likely to be overlooked. This study employs a deep learning method optimized by negative correlation ensemble techniques to predict the gas permeability and selectivity of polyimide structures, enabling rapid and efficient material screening. We propose a deep neural network model based on negative correlation deep ensemble methods (DNN-NCL), using Morgan molecular fingerprints as input. The DNN-NCL model achieves an R2 value of approximately 0.95 on the test set, which is a 4% improvement over recent model performance, and effectively mitigates overfitting with a maximum discrepancy of less than 0.03 between the training and test sets. High-throughput screening of over 8 million hypothetical polymers identified hundreds of promising candidates for gas separation membranes, with 14 structures exceeding the Robeson upper bound for CO2/N2 separation. Visualization of high-throughput predictions shows that although the Robeson upper bound was never explicitly used as a model constraint, the majority of predictions are compressed below this limit, demonstrating the deep learning model's ability to reflect real-world physical conditions. Reverse analysis of model predictions using SHAP analysis achieved interpretability of the deep learning model's predictions and identified three key functional groups deemed important by the deep neural network for gas permeability: carbonyl, thiophene, and ester groups. This established a bridge between the structure and properties of polyimide materials. Additionally, we confirmed that two polyimide structures predicted by the model to have excellent CO2/N2 selectivity, namely 6-methylpyrimidin-5-amine and 1,4,5,6-tetrahydropyrimidin-2-amine, have been experimentally validated in previous studies. This research demonstrates the feasibility of using deep learning methods to explore the vast chemical space of polyimides, providing a powerful tool for discovering high-performance gas separation membranes.
Collapse
Affiliation(s)
- Ruochen Xi
- School of Petrochemical Engineering, Shenyang University of Technology, Liaoyang, China.
| | - Hongjing Liu
- School of Petrochemical Engineering, Shenyang University of Technology, Liaoyang, China.
| | - Xueli Liu
- School of Petrochemical Engineering, Shenyang University of Technology, Liaoyang, China.
| | - Xu Zhao
- School of Petrochemical Engineering, Shenyang University of Technology, Liaoyang, China.
| |
Collapse
|
9
|
Khan MZI, Ren JN, Cao C, Ye HYX, Wang H, Guo YM, Yang JR, Chen JZ. Comprehensive hepatotoxicity prediction: ensemble model integrating machine learning and deep learning. Front Pharmacol 2024; 15:1441587. [PMID: 39234116 PMCID: PMC11373136 DOI: 10.3389/fphar.2024.1441587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Accepted: 07/24/2024] [Indexed: 09/06/2024] Open
Abstract
Background Chemicals may lead to acute liver injuries, posing a serious threat to human health. Achieving the precise safety profile of a compound is challenging due to the complex and expensive testing procedures. In silico approaches will aid in identifying the potential risk of drug candidates in the initial stage of drug development and thus mitigating the developmental cost. Methods In current studies, QSAR models were developed for hepatotoxicity predictions using the ensemble strategy to integrate machine learning (ML) and deep learning (DL) algorithms using various molecular features. A large dataset of 2588 chemicals and drugs was randomly divided into training (80%) and test (20%) sets, followed by the training of individual base models using diverse machine learning or deep learning based on three different kinds of descriptors and fingerprints. Feature selection approaches were employed to proceed with model optimizations based on the model performance. Hybrid ensemble approaches were further utilized to determine the method with the best performance. Results The voting ensemble classifier emerged as the optimal model, achieving an excellent prediction accuracy of 80.26%, AUC of 82.84%, and recall of over 93% followed by bagging and stacking ensemble classifiers method. The model was further verified by an external test set, internal 10-fold cross-validation, and rigorous benchmark training, exhibiting much better reliability than the published models. Conclusion The proposed ensemble model offers a dependable assessment with a good performance for the prediction regarding the risk of chemicals and drugs to induce liver damage.
Collapse
Affiliation(s)
| | - Jia-Nan Ren
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Cheng Cao
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Polytechnic Institute, Zhejiang University, Hangzhou, China
| | - Hong-Yu-Xiang Ye
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Hao Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Ya-Min Guo
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jin-Rong Yang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Polytechnic Institute, Zhejiang University, Hangzhou, China
| | - Jian-Zhong Chen
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| |
Collapse
|
10
|
Akgüller Ö, Balcı MA, Cioca G. Clustering Molecules at a Large Scale: Integrating Spectral Geometry with Deep Learning. Molecules 2024; 29:3902. [PMID: 39202980 PMCID: PMC11357287 DOI: 10.3390/molecules29163902] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2024] [Revised: 08/14/2024] [Accepted: 08/14/2024] [Indexed: 09/03/2024] Open
Abstract
This study conducts an in-depth analysis of clustering small molecules using spectral geometry and deep learning techniques. We applied a spectral geometric approach to convert molecular structures into triangulated meshes and used the Laplace-Beltrami operator to derive significant geometric features. By examining the eigenvectors of these operators, we captured the intrinsic geometric properties of the molecules, aiding their classification and clustering. The research utilized four deep learning methods: Deep Belief Network, Convolutional Autoencoder, Variational Autoencoder, and Adversarial Autoencoder, each paired with k-means clustering at different cluster sizes. Clustering quality was evaluated using the Calinski-Harabasz and Davies-Bouldin indices, Silhouette Score, and standard deviation. Nonparametric tests were used to assess the impact of topological descriptors on clustering outcomes. Our results show that the DBN + k-means combination is the most effective, particularly at lower cluster counts, demonstrating significant sensitivity to structural variations. This study highlights the potential of integrating spectral geometry with deep learning for precise and efficient molecular clustering.
Collapse
Affiliation(s)
- Ömer Akgüller
- Faculty of Science, Department of Mathematics, Mugla Sitki Kocman University, Muğla 48000, Turkey;
| | - Mehmet Ali Balcı
- Faculty of Science, Department of Mathematics, Mugla Sitki Kocman University, Muğla 48000, Turkey;
| | - Gabriela Cioca
- Faculty of Medicine, Preclinical Department, Lucian Blaga University of Sibiu, 550024 Sibiu, Romania;
| |
Collapse
|
11
|
Gricourt G, Meyer P, Duigou T, Faulon JL. Artificial Intelligence Methods and Models for Retro-Biosynthesis: A Scoping Review. ACS Synth Biol 2024; 13:2276-2294. [PMID: 39047143 PMCID: PMC11334239 DOI: 10.1021/acssynbio.4c00091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 06/14/2024] [Accepted: 06/14/2024] [Indexed: 07/27/2024]
Abstract
Retrosynthesis aims to efficiently plan the synthesis of desirable chemicals by strategically breaking down molecules into readily available building block compounds. Having a long history in chemistry, retro-biosynthesis has also been used in the fields of biocatalysis and synthetic biology. Artificial intelligence (AI) is driving us toward new frontiers in synthesis planning and the exploration of chemical spaces, arriving at an opportune moment for promoting bioproduction that would better align with green chemistry, enhancing environmental practices. In this review, we summarize the recent advancements in the application of AI methods and models for retrosynthetic and retro-biosynthetic pathway design. These techniques can be based either on reaction templates or generative models and require scoring functions and planning strategies to navigate through the retrosynthetic graph of possibilities. We finally discuss limitations and promising research directions in this field.
Collapse
Affiliation(s)
- Guillaume Gricourt
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Philippe Meyer
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Thomas Duigou
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
| | - Jean-Loup Faulon
- Université
Paris-Saclay, INRAE, AgroParisTech, Micalis
Institute, 78350 Jouy-en-Josas, France
- The
University of Manchester, Manchester Institute
of Biotechnology, Manchester M1 7DN, U.K.
| |
Collapse
|
12
|
Sun YY, Hsieh CY, Wen JH, Tseng TY, Huang JH, Oyang YJ, Huang HC, Juan HF. scDrug+: predicting drug-responses using single-cell transcriptomics and molecular structure. Biomed Pharmacother 2024; 177:117070. [PMID: 38964180 DOI: 10.1016/j.biopha.2024.117070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 06/18/2024] [Accepted: 06/29/2024] [Indexed: 07/06/2024] Open
Abstract
Predicting drug responses based on individual transcriptomic profiles holds promise for refining prognosis and advancing precision medicine. Although many studies have endeavored to predict the responses of known drugs to novel transcriptomic profiles, research into predicting responses for newly discovered drugs remains sparse. In this study, we introduce scDrug+, a comprehensive pipeline that seamlessly integrates single-cell analysis with drug-response prediction. Importantly, scDrug+ is equipped to predict the response of new drugs by analyzing their molecular structures. The open-source tool is available as a Docker container, ensuring ease of deployment and reproducibility. It can be accessed at https://github.com/ailabstw/scDrugplus.
Collapse
Affiliation(s)
- Yih-Yun Sun
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan; Taiwan AI Labs, Taipei 10351, Taiwan
| | | | - Jian-Hung Wen
- Taiwan AI Labs, Taipei 10351, Taiwan; Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei 11221, Taiwan
| | - Tzu-Yang Tseng
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan; Department of Life Science, National Taiwan University, Taipei 106, Taiwan
| | | | - Yen-Jen Oyang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan
| | - Hsuan-Cheng Huang
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei 11221, Taiwan.
| | - Hsueh-Fen Juan
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan; Taiwan AI Labs, Taipei 10351, Taiwan; Department of Life Science, National Taiwan University, Taipei 106, Taiwan; Center for Computational and Systems Biology, National Taiwan University, Taipei 106, Taiwan; Center for Advanced Computing and Imaging in Biomedicine, National Taiwan University, Taipei 106, Taiwan.
| |
Collapse
|
13
|
Hauben M. A Pharmacovigilance Florilegium. Clin Ther 2024; 46:520-523. [PMID: 39030077 DOI: 10.1016/j.clinthera.2024.06.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Accepted: 06/11/2024] [Indexed: 07/21/2024]
Affiliation(s)
- Manfred Hauben
- Department of Family and Community Medicine, New York Medical College, Valhalla, New York; Truliant Consulting, Baltimore, Maryland.
| |
Collapse
|
14
|
Kalikadien AV, Mirza A, Hossaini AN, Sreenithya A, Pidko EA. Paving the road towards automated homogeneous catalyst design. Chempluschem 2024; 89:e202300702. [PMID: 38279609 DOI: 10.1002/cplu.202300702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 12/20/2023] [Indexed: 01/28/2024]
Abstract
In the past decade, computational tools have become integral to catalyst design. They continue to offer significant support to experimental organic synthesis and catalysis researchers aiming for optimal reaction outcomes. More recently, data-driven approaches utilizing machine learning have garnered considerable attention for their expansive capabilities. This Perspective provides an overview of diverse initiatives in the realm of computational catalyst design and introduces our automated tools tailored for high-throughput in silico exploration of the chemical space. While valuable insights are gained through methods for high-throughput in silico exploration and analysis of chemical space, their degree of automation and modularity are key. We argue that the integration of data-driven, automated and modular workflows is key to enhancing homogeneous catalyst design on an unprecedented scale, contributing to the advancement of catalysis research.
Collapse
Affiliation(s)
- Adarsh V Kalikadien
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Adrian Mirza
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Aydin Najl Hossaini
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Avadakkam Sreenithya
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| | - Evgeny A Pidko
- Inorganic Systems Engineering, Department of Chemical Engineering, Faculty of Applied Sciences, Delft University of Technology, Van der Maasweg 9, 2629 HZ, Delft, The Netherlands
| |
Collapse
|
15
|
Tong X, Qu N, Kong X, Ni S, Zhou J, Wang K, Zhang L, Wen Y, Shi J, Zhang S, Li X, Zheng M. Deep representation learning of chemical-induced transcriptional profile for phenotype-based drug discovery. Nat Commun 2024; 15:5378. [PMID: 38918369 PMCID: PMC11199551 DOI: 10.1038/s41467-024-49620-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Accepted: 06/10/2024] [Indexed: 06/27/2024] Open
Abstract
Artificial intelligence transforms drug discovery, with phenotype-based approaches emerging as a promising alternative to target-based methods, overcoming limitations like lack of well-defined targets. While chemical-induced transcriptional profiles offer a comprehensive view of drug mechanisms, inherent noise often obscures the true signal, hindering their potential for meaningful insights. Here, we highlight the development of TranSiGen, a deep generative model employing self-supervised representation learning. TranSiGen analyzes basal cell gene expression and molecular structures to reconstruct chemical-induced transcriptional profiles with high accuracy. By capturing both cellular and compound information, TranSiGen-derived representations demonstrate efficacy in diverse downstream tasks like ligand-based virtual screening, drug response prediction, and phenotype-based drug repurposing. Notably, in vitro validation of TranSiGen's application in pancreatic cancer drug discovery highlights its potential for identifying effective compounds. We envisage that integrating TranSiGen into the drug discovery and mechanism research holds significant promise for advancing biomedicine.
Collapse
Affiliation(s)
- Xiaochu Tong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Ning Qu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Xiangtai Kong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Shengkun Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Jingyi Zhou
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, 201210, China
- Lingang Laboratory, Shanghai, 200031, China
| | - Kun Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, 230026, China
| | - Lehan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Yiming Wen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China
| | - Jiangshan Shi
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Sulin Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China.
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China.
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China.
- School of Pharmaceutical Science and Technology, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China.
| |
Collapse
|
16
|
Sirocchi C, Biancucci F, Donati M, Bogliolo A, Magnani M, Menotta M, Montagna S. Exploring machine learning for untargeted metabolomics using molecular fingerprints. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 250:108163. [PMID: 38626559 DOI: 10.1016/j.cmpb.2024.108163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 03/15/2024] [Accepted: 04/03/2024] [Indexed: 04/18/2024]
Abstract
BACKGROUND Metabolomics, the study of substrates and products of cellular metabolism, offers valuable insights into an organism's state under specific conditions and has the potential to revolutionise preventive healthcare and pharmaceutical research. However, analysing large metabolomics datasets remains challenging, with available methods relying on limited and incompletely annotated metabolic pathways. METHODS This study, inspired by well-established methods in drug discovery, employs machine learning on metabolite fingerprints to explore the relationship of their structure with responses in experimental conditions beyond known pathways, shedding light on metabolic processes. It evaluates fingerprinting effectiveness in representing metabolites, addressing challenges like class imbalance, data sparsity, high dimensionality, duplicate structural encoding, and interpretable features. Feature importance analysis is then applied to reveal key chemical configurations affecting classification, identifying related metabolite groups. RESULTS The approach is tested on two datasets: one on Ataxia Telangiectasia and another on endothelial cells under low oxygen. Machine learning on molecular fingerprints predicts metabolite responses effectively, and feature importance analysis aligns with known metabolic pathways, unveiling new affected metabolite groups for further study. CONCLUSION In conclusion, the presented approach leverages the strengths of drug discovery to address critical issues in metabolomics research and aims to bridge the gap between these two disciplines. This work lays the foundation for future research in this direction, possibly exploring alternative structural encodings and machine learning models.
Collapse
Affiliation(s)
- Christel Sirocchi
- Department of Pure and Applied Sciences, University of Urbino, Piazza della Repubblica, 13, Urbino, 61029, Italy.
| | - Federica Biancucci
- Department of Biomolecular Sciences, University of Urbino, Via Saffi 2, Urbino, 61029, Italy
| | - Matteo Donati
- Department of Pure and Applied Sciences, University of Urbino, Piazza della Repubblica, 13, Urbino, 61029, Italy
| | - Alessandro Bogliolo
- Department of Pure and Applied Sciences, University of Urbino, Piazza della Repubblica, 13, Urbino, 61029, Italy
| | - Mauro Magnani
- Department of Biomolecular Sciences, University of Urbino, Via Saffi 2, Urbino, 61029, Italy
| | - Michele Menotta
- Department of Biomolecular Sciences, University of Urbino, Via Saffi 2, Urbino, 61029, Italy
| | - Sara Montagna
- Department of Pure and Applied Sciences, University of Urbino, Piazza della Repubblica, 13, Urbino, 61029, Italy
| |
Collapse
|
17
|
Das M, Ghosh A, Sunoj RB. Advances in machine learning with chemical language models in molecular property and reaction outcome predictions. J Comput Chem 2024; 45:1160-1176. [PMID: 38299229 DOI: 10.1002/jcc.27315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/06/2024] [Accepted: 01/09/2024] [Indexed: 02/02/2024]
Abstract
Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well-developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro-synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost-effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high-quality datasets.
Collapse
Affiliation(s)
- Manajit Das
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Ankit Ghosh
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai, India
| |
Collapse
|
18
|
Xiang W, Zhong F, Ni L, Zheng M, Li X, Shi Q, Wang D. Gram matrix: an efficient representation of molecular conformation and learning objective for molecular pretraining. Brief Bioinform 2024; 25:bbae340. [PMID: 38990515 PMCID: PMC11238115 DOI: 10.1093/bib/bbae340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 06/05/2024] [Accepted: 06/28/2024] [Indexed: 07/12/2024] Open
Abstract
Accurate prediction of molecular properties is fundamental in drug discovery and development, providing crucial guidance for effective drug design. A critical factor in achieving accurate molecular property prediction lies in the appropriate representation of molecular structures. Presently, prevalent deep learning-based molecular representations rely on 2D structure information as the primary molecular representation, often overlooking essential three-dimensional (3D) conformational information due to the inherent limitations of 2D structures in conveying atomic spatial relationships. In this study, we propose employing the Gram matrix as a condensed representation of 3D molecular structures and for efficient pretraining objectives. Subsequently, we leverage this matrix to construct a novel molecular representation model, Pre-GTM, which inherently encapsulates 3D information. The model accurately predicts the 3D structure of a molecule by estimating the Gram matrix. Our findings demonstrate that Pre-GTM model outperforms the baseline Graphormer model and other pretrained models in the QM9 and MoleculeNet quantitative property prediction task. The integration of the Gram matrix as a condensed representation of 3D molecular structure, incorporated into the Pre-GTM model, opens up promising avenues for its potential application across various domains of molecular research, including drug design, materials science, and chemical engineering.
Collapse
Affiliation(s)
| | - Feisheng Zhong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
- Fujian Key Laboratory of Drug Target Discovery and Structural and Functional Research, School of Pharmacy, Fujian Medical University, Fuzhou 350122, China
| | - Lin Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing 210023, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing 210023, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Qian Shi
- Lingang Laboratory, Shanghai 200031, China
| | | |
Collapse
|
19
|
Oniani D, Hilsman J, Zang C, Wang J, Cai L, Zawala J, Wang Y. Emerging opportunities of using large language models for translation between drug molecules and indications. Sci Rep 2024; 14:10738. [PMID: 38730226 PMCID: PMC11087469 DOI: 10.1038/s41598-024-61124-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 05/02/2024] [Indexed: 05/12/2024] Open
Abstract
A drug molecule is a substance that changes an organism's mental or physical state. Every approved drug has an indication, which refers to the therapeutic use of that drug for treating a particular medical condition. While the Large Language Model (LLM), a generative Artificial Intelligence (AI) technique, has recently demonstrated effectiveness in translating between molecules and their textual descriptions, there remains a gap in research regarding their application in facilitating the translation between drug molecules and indications (which describes the disease, condition or symptoms for which the drug is used), or vice versa. Addressing this challenge could greatly benefit the drug discovery process. The capability of generating a drug from a given indication would allow for the discovery of drugs targeting specific diseases or targets and ultimately provide patients with better treatments. In this paper, we first propose a new task, the translation between drug molecules and corresponding indications, and then test existing LLMs on this new task. Specifically, we consider nine variations of the T5 LLM and evaluate them on two public datasets obtained from ChEMBL and DrugBank. Our experiments show the early results of using LLMs for this task and provide a perspective on the state-of-the-art. We also emphasize the current limitations and discuss future work that has the potential to improve the performance on this task. The creation of molecules from indications, or vice versa, will allow for more efficient targeting of diseases and significantly reduce the cost of drug discovery, with the potential to revolutionize the field of drug discovery in the era of generative AI.
Collapse
Affiliation(s)
- David Oniani
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jordan Hilsman
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chengxi Zang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
| | - Junmei Wang
- Department of Pharmaceutical Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Lianjin Cai
- Department of Pharmaceutical Sciences, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jan Zawala
- Jerzy Haber Institute of Catalysis and Surface Chemistry, Polish Academy of Sciences, Kraków, Poland
| | - Yanshan Wang
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
| |
Collapse
|
20
|
Pang C, Qiao J, Zeng X, Zou Q, Wei L. Deep Generative Models in De Novo Drug Molecule Generation. J Chem Inf Model 2024; 64:2174-2194. [PMID: 37934070 DOI: 10.1021/acs.jcim.3c01496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2023]
Abstract
The discovery of new drugs has important implications for human health. Traditional methods for drug discovery rely on experiments to optimize the structure of lead molecules, which are time-consuming and high-cost. Recently, artificial intelligence has exhibited promising and efficient performance for drug-like molecule generation. In particular, deep generative models achieve great success in de novo generation of drug-like molecules with desired properties, showing massive potential for novel drug discovery. In this study, we review the recent progress of molecule generation using deep generative models, mainly focusing on molecule representations, public databases, data processing tools, and advanced artificial intelligence based molecule generation frameworks. In particular, we present a comprehensive comparison of state-of-the-art deep generative models for molecule generation and a summary of commonly used molecular design strategies. We identify research gaps and challenges of molecule generation such as the need for better databases, missing 3D information in molecular representation, and the lack of high-precision evaluation metrics. We suggest future directions for molecular generation and drug discovery.
Collapse
Affiliation(s)
- Chao Pang
- School of Software, Shandong University, Jinan 250100, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250100, China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250100, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250100, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, Changsha 410082, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250100, China
- Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250100, China
| |
Collapse
|
21
|
Bi H, Jiang J, Chen J, Kuang X, Zhang J. Machine Learning Prediction of Quantum Yields and Wavelengths of Aggregation-Induced Emission Molecules. MATERIALS (BASEL, SWITZERLAND) 2024; 17:1664. [PMID: 38612177 PMCID: PMC11012915 DOI: 10.3390/ma17071664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 03/27/2024] [Accepted: 04/02/2024] [Indexed: 04/14/2024]
Abstract
The aggregation-induced emission (AIE) effect exhibits a significant influence on the development of luminescent materials and has made remarkable progress over the past decades. The advancement of high-performance AIE materials requires fast and accurate predictions of their photophysical properties, which is impeded by the inherent limitations of quantum chemical calculations. In this work, we present an accurate machine learning approach for the fast predictions of quantum yields and wavelengths to screen out AIE molecules. A database of about 563 organic luminescent molecules with quantum yields and wavelengths in the monomeric/aggregated states was established. Individual/combined molecular fingerprints were selected and compared elaborately to attain appropriate molecular descriptors. Different machine learning algorithms combined with favorable molecular fingerprints were further screened to achieve more accurate prediction models. The simulation results indicate that combined molecular fingerprints yield more accurate predictions in the aggregated states, and random forest and gradient boosting regression algorithms show the best predictions in quantum yields and wavelengths, respectively. Given the successful applications of machine learning in quantum yields and wavelengths, it is reasonable to anticipate that machine learning can serve as a complementary strategy to traditional experimental/theoretical methods in the investigation of aggregation-induced luminescent molecules to facilitate the discovery of luminescent materials.
Collapse
Affiliation(s)
| | | | | | | | - Jinxiao Zhang
- College of Chemistry and Bioengineering, Guilin University of Technology, Guilin 541006, China; (H.B.)
| |
Collapse
|
22
|
Gouveia GJ, Head T, Cheng LL, Clendinen CS, Cort JR, Du X, Edison AS, Fleischer CC, Hoch J, Mercaldo N, Pathmasiri W, Raftery D, Schock TB, Sumner LW, Takis PG, Copié V, Eghbalnia HR, Powers R. Perspective: use and reuse of NMR-based metabolomics data: what works and what remains challenging. Metabolomics 2024; 20:41. [PMID: 38480600 DOI: 10.1007/s11306-024-02090-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 01/12/2024] [Indexed: 04/20/2024]
Abstract
BACKGROUND The National Cancer Institute issued a Request for Information (RFI; NOT-CA-23-007) in October 2022, soliciting input on using and reusing metabolomics data. This RFI aimed to gather input on best practices for metabolomics data storage, management, and use/reuse. AIM OF REVIEW The nuclear magnetic resonance (NMR) Interest Group within the Metabolomics Association of North America (MANA) prepared a set of recommendations regarding the deposition, archiving, use, and reuse of NMR-based and, to a lesser extent, mass spectrometry (MS)-based metabolomics datasets. These recommendations were built on the collective experiences of metabolomics researchers within MANA who are generating, handling, and analyzing diverse metabolomics datasets spanning experimental (sample handling and preparation, NMR/MS metabolomics data acquisition, processing, and spectral analyses) to computational (automation of spectral processing, univariate and multivariate statistical analysis, metabolite prediction and identification, multi-omics data integration, etc.) studies. KEY SCIENTIFIC CONCEPTS OF REVIEW We provide a synopsis of our collective view regarding the use and reuse of metabolomics data and articulate several recommendations regarding best practices, which are aimed at encouraging researchers to strengthen efforts toward maximizing the utility of metabolomics data, multi-omics data integration, and enhancing the overall scientific impact of metabolomics studies.
Collapse
Affiliation(s)
- Goncalo Jorge Gouveia
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Institute for Bioscience and Biotechnology Research, National Institute of Standards and Technology, University of Maryland, Gudelsky Drive, Rockville, MD, 20850, USA
| | - Thomas Head
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- University of British Columbia, Kelowna, BC, V1V 1V7, Canada
| | - Leo L Cheng
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Pathology and Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Chaevien S Clendinen
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Earth and Biological Sciences Directorate, Environmental Molecular Sciences Laboratory, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - John R Cort
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Earth and Biological Sciences Directorate, Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, 99352, USA
| | - Xiuxia Du
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, 9291 University City Blvd, Charlotte, NC, 28223, USA
| | - Arthur S Edison
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Biochemistry, University of Georgia, Athens, GA, USA
| | - Candace C Fleischer
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Radiology and Imaging Sciences, Emory University School of Medicine, Atlanta, GA, 30322, USA
| | - Jeffrey Hoch
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Molecular Biology and Biophysics, UConn Health, Farmington, CT, 06030-3305, USA
| | - Nathaniel Mercaldo
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Radiology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Wimal Pathmasiri
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Nutrition, School of Public Health, Nutrition Research Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Daniel Raftery
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Anesthesia and Pain Medicine, University of Washington, Seattle, WA, 98109, USA
| | - Tracey B Schock
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Chemical Sciences Division, National Institute of Standards and Technology (NIST), Charleston, SC, 29412, USA
| | - Lloyd W Sumner
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Biochemistry, MU Metabolomics Center, Bond Life Sciences Center, Interdisciplinary Plant Group, University of Missouri, Columbia, MO, 65211, USA
| | - Panteleimon G Takis
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Section of Bioanalytical Chemistry, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London, SW7 2AZ, UK
- Department of Metabolism, Digestion and Reproduction, National Phenome Centre, Imperial College London, London, W12 0NN, UK
| | - Valérie Copié
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Chemistry and Biochemistry, Montana State University, Bozeman, MT, 59717-3400, USA
| | - Hamid R Eghbalnia
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada
- Department of Molecular Biology and Biophysics, UConn Health, Farmington, CT, 06030-3305, USA
| | - Robert Powers
- Metabolomics Association of North America (MANA), NMR Special Interest Group, Edmonton, Canada.
- Department of Chemistry, Nebraska Center for Integrated Biomolecular Communication, University of Nebraska-Lincoln, 722 Hamilton Hall, Lincoln, NE, 68588-0304, USA.
| |
Collapse
|
23
|
Kirchoff KE, Wellnitz J, Hochuli JE, Maxfield T, Popov KI, Gomez S, Tropsha A. Utilizing Low-Dimensional Molecular Embeddings for Rapid Chemical Similarity Search. ADVANCES IN INFORMATION RETRIEVAL : ... EUROPEAN CONFERENCE ON IR RESEARCH, ECIR ... PROCEEDINGS. EUROPEAN CONFERENCE ON IR RESEARCH 2024; 14609:34-49. [PMID: 38585224 PMCID: PMC10998712 DOI: 10.1007/978-3-031-56060-6_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Nearest neighbor-based similarity searching is a common task in chemistry, with notable use cases in drug discovery. Yet, some of the most commonly used approaches for this task still leverage a brute-force approach. In practice this can be computationally costly and overly time-consuming, due in part to the sheer size of modern chemical databases. Previous computational advancements for this task have generally relied on improvements to hardware or dataset-specific tricks that lack generalizability. Approaches that leverage lower-complexity searching algorithms remain relatively underexplored. However, many of these algorithms are approximate solutions and/or struggle with typical high-dimensional chemical embeddings. Here we evaluate whether a combination of low-dimensional chemical embeddings and a k-d tree data structure can achieve fast nearest neighbor queries while maintaining performance on standard chemical similarity search benchmarks. We examine different dimensionality reductions of standard chemical embeddings as well as a learned, structurally-aware embedding-SmallSA-for this task. With this framework, searches on over one billion chemicals execute in less than a second on a single CPU core, five orders of magnitude faster than the brute-force approach. We also demonstrate that SmallSA achieves competitive performance on chemical similarity benchmarks.
Collapse
Affiliation(s)
| | | | | | | | | | - Shawn Gomez
- Department of Pharmacology, UNC Chapel Hill
- Joint Department of Biomedical Engineering at UNC Chapel Hill and NCSU
| | | |
Collapse
|
24
|
Han J, Kwon Y, Choi YS, Kang S. Improving chemical reaction yield prediction using pre-trained graph neural networks. J Cheminform 2024; 16:25. [PMID: 38429787 PMCID: PMC10905905 DOI: 10.1186/s13321-024-00818-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/19/2024] [Indexed: 03/03/2024] Open
Abstract
Graph neural networks (GNNs) have proven to be effective in the prediction of chemical reaction yields. However, their performance tends to deteriorate when they are trained using an insufficient training dataset in terms of quantity or diversity. A promising solution to alleviate this issue is to pre-train a GNN on a large-scale molecular database. In this study, we investigate the effectiveness of GNN pre-training in chemical reaction yield prediction. We present a novel GNN pre-training method for performance improvement.Given a molecular database consisting of a large number of molecules, we calculate molecular descriptors for each molecule and reduce the dimensionality of these descriptors by applying principal component analysis. We define a pre-text task by assigning a vector of principal component scores as the pseudo-label to each molecule in the database. A GNN is then pre-trained to perform the pre-text task of predicting the pseudo-label for the input molecule. For chemical reaction yield prediction, a prediction model is initialized using the pre-trained GNN and then fine-tuned with the training dataset containing chemical reactions and their yields. We demonstrate the effectiveness of the proposed method through experimental evaluation on benchmark datasets.
Collapse
Affiliation(s)
- Jongmin Han
- Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea
| | - Youngchun Kwon
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea
| | - Youn-Suk Choi
- Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea.
| | - Seokho Kang
- Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea.
| |
Collapse
|
25
|
Kutsal M, Ucar F, Kati N. Computational drug discovery on human immunodeficiency virus with a customized long short-term memory variational autoencoder deep-learning architecture. CPT Pharmacometrics Syst Pharmacol 2024; 13:308-316. [PMID: 38010989 PMCID: PMC10864928 DOI: 10.1002/psp4.13085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2023] [Revised: 11/01/2023] [Accepted: 11/07/2023] [Indexed: 11/29/2023] Open
Abstract
Despite attempts to control the spread of human immunodeficiency virus (HIV) through the use of anti-HIV medications, the absence of an effective vaccine continues to present a significant obstacle. In addition, the development of drug resistance by HIV underscores the necessity for computational drug discovery methods to identify novel therapies. This investigation specifically focused on employing a long short-term memory (LSTM) variational autoencoder deep-learning architecture for computational drug discovery in relation to HIV. Our data set comprised simplified molecular input line entry system (SMILES)-encoded compounds, which were used to train the LSTM autoencoder. Remarkably, our model achieved a training accuracy of 91%, with a data set containing 1377 compounds. Leveraging the generative model derived from the training phase, we generated potential new drugs for combating HIV and assessed their interaction with the virus using a previously developed artificial intelligence model. Lastly, we verified the drug likeliness of our computationally generated compounds in accordance with Lipinski's rule of five. Overall, our study presents a promising approach to computational drug discovery in the ongoing battle against HIV.
Collapse
Affiliation(s)
- Mucahit Kutsal
- Institute of Theoretical Physics and Astrophysics, Quantum Information TechnologyUniversity of GdańskGdańskPoland
| | - Ferhat Ucar
- Faculty of Technology, Software EngineeringFırat UniversityElazigTurkey
| | - Nida Kati
- Faculty of Technology, Materials and Metallurgical EngineeringFırat UniversityElazigTurkey
| |
Collapse
|
26
|
Xiao F, Ding X, Shi Y, Wang D, Wang Y, Cui C, Zhu T, Chen K, Xiang P, Luo X. Application of ensemble learning for predicting GABA A receptor agonists. Comput Biol Med 2024; 169:107958. [PMID: 38194778 DOI: 10.1016/j.compbiomed.2024.107958] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 12/29/2023] [Accepted: 01/01/2024] [Indexed: 01/11/2024]
Abstract
BACKGROUND Over the past few decades, agonists binding to the benzodiazepine site of the GABAA receptor have been successfully developed as clinical drugs. Different modulators (agonist, antagonist, and reverse agonist) bound to benzodiazepine sites exhibit different or even opposite pharmacological effects, however, their structures are so similar that it is difficult to distinguish them based solely on molecular skeleton. This study aims to develop classification models for predicting the agonists. METHODS 306 agonists or non-agonists were collected from literature. Six machine learning algorithms including RF, XGBoost, AdaBoost, GBoost, SVM, and ANN algorithms were employed for model development. Using six descriptors including 1D/2D Descriptors, ECFP4, 2D-Pharmacophore, MACCS, PubChem, and Estate fingerprint to characterize chemical structures. The model interpretability was explored by SHAP method. RESULTS The best model demonstrated an AUC value of 0.905 and an MCC value of 0.808 for the test set. The PubMac-based model (PubMac-GB) achieved best AUC values of 0.935 for test set. The SHAP analysis results emphasized that MaccsFP62, ECFP_624, ECFP_724, and PubchemFP213 were the crucial molecular features. Applicability domain analysis was also performed to determine reliable prediction boundaries for the model. The PubMac-GB model was applied to virtual screening for potential GABAA agonists and the top 100 compounds were given. CONCLUSION Overall, our ensemble learning-based model (PubMac-GB) achieved comparable performance and would be helpful in effectively identifying agonists of GABAA receptors.
Collapse
Affiliation(s)
- Fu Xiao
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing, 210023, China; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China
| | - Xiaoyu Ding
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Yan Shi
- Academy of Forensic Science, Shanghai Key Laboratory of Forensic Medicine, Shanghai Forensic Service Platform, Key Laboratory of Forensic Science, Ministry of Justice, Shanghai, 200063, China
| | - Dingyan Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Yitian Wang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Chen Cui
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Tingfei Zhu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Kaixian Chen
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing, 210023, China; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China
| | - Ping Xiang
- Academy of Forensic Science, Shanghai Key Laboratory of Forensic Medicine, Shanghai Forensic Service Platform, Key Laboratory of Forensic Science, Ministry of Justice, Shanghai, 200063, China.
| | - Xiaomin Luo
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing, 210023, China; Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai, 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing, 100049, China.
| |
Collapse
|
27
|
Song Z, Chen J, Cheng J, Chen G, Qi Z. Computer-Aided Molecular Design of Ionic Liquids as Advanced Process Media: A Review from Fundamentals to Applications. Chem Rev 2024; 124:248-317. [PMID: 38108629 DOI: 10.1021/acs.chemrev.3c00223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The unique physicochemical properties, flexible structural tunability, and giant chemical space of ionic liquids (ILs) provide them a great opportunity to match different target properties to work as advanced process media. The crux of the matter is how to efficiently and reliably tailor suitable ILs toward a specific application. In this regard, the computer-aided molecular design (CAMD) approach has been widely adapted to cover this family of high-profile chemicals, that is, to perform computer-aided IL design (CAILD). This review discusses the past developments that have contributed to the state-of-the-art of CAILD and provides a perspective about how future works could pursue the acceleration of the practical application of ILs. In a broad context of CAILD, key aspects related to the forward structure-property modeling and reverse molecular design of ILs are overviewed. For the former forward task, diverse IL molecular representations, modeling algorithms, as well as representative models on physical properties, thermodynamic properties, among others of ILs are introduced. For the latter reverse task, representative works formulating different molecular design scenarios are summarized. Beyond the substantial progress made, some future perspectives to move CAILD a step forward are finally provided.
Collapse
Affiliation(s)
- Zhen Song
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jiahui Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jie Cheng
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Guzhong Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Zhiwen Qi
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| |
Collapse
|
28
|
Zhu J, Che C, Jiang H, Xu J, Yin J, Zhong Z. SSF-DDI: a deep learning method utilizing drug sequence and substructure features for drug-drug interaction prediction. BMC Bioinformatics 2024; 25:39. [PMID: 38262923 PMCID: PMC10810255 DOI: 10.1186/s12859-024-05654-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 01/12/2024] [Indexed: 01/25/2024] Open
Abstract
BACKGROUND Drug-drug interactions (DDI) are prevalent in combination therapy, necessitating the importance of identifying and predicting potential DDI. While various artificial intelligence methods can predict and identify potential DDI, they often overlook the sequence information of drug molecules and fail to comprehensively consider the contribution of molecular substructures to DDI. RESULTS In this paper, we proposed a novel model for DDI prediction based on sequence and substructure features (SSF-DDI) to address these issues. Our model integrates drug sequence features and structural features from the drug molecule graph, providing enhanced information for DDI prediction and enabling a more comprehensive and accurate representation of drug molecules. CONCLUSION The results of experiments and case studies have demonstrated that SSF-DDI significantly outperforms state-of-the-art DDI prediction models across multiple real datasets and settings. SSF-DDI performs better in predicting DDI involving unknown drugs, resulting in a 5.67% improvement in accuracy compared to state-of-the-art methods.
Collapse
Affiliation(s)
- Jing Zhu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Chao Che
- School of Software Engineering, Dalian University, Dalian, 116000, China
| | - Hao Jiang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Jian Xu
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Jiajun Yin
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Zhaoqian Zhong
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China.
| |
Collapse
|
29
|
Zdrazil B, Guha R, Martinez-Mayorga K, Jeliazkova N. Are new ideas harder to find? A note on incremental research and Journal of Cheminformatics' Scientific Contribution Statement. J Cheminform 2024; 16:6. [PMID: 38221625 PMCID: PMC10789001 DOI: 10.1186/s13321-023-00798-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2024] Open
Affiliation(s)
- Barbara Zdrazil
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK.
| | - Rajarshi Guha
- Vertex Pharmaceuticals, 50 Northern Ave, 02210, Boston, MA, USA
| | - Karina Martinez-Mayorga
- Institute of Chemistry, National Autonomous University of Mexico, Campus Merida, Merida-Tetiz Highway, Km. 4.5, Ucu, Yucatan, Mexico
| | | |
Collapse
|
30
|
Bi X, Lin L, Chen Z, Ye J. Artificial Intelligence for Surface-Enhanced Raman Spectroscopy. SMALL METHODS 2024; 8:e2301243. [PMID: 37888799 DOI: 10.1002/smtd.202301243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/11/2023] [Indexed: 10/28/2023]
Abstract
Surface-enhanced Raman spectroscopy (SERS), well acknowledged as a fingerprinting and sensitive analytical technique, has exerted high applicational value in a broad range of fields including biomedicine, environmental protection, food safety among the others. In the endless pursuit of ever-sensitive, robust, and comprehensive sensing and imaging, advancements keep emerging in the whole pipeline of SERS, from the design of SERS substrates and reporter molecules, synthetic route planning, instrument refinement, to data preprocessing and analysis methods. Artificial intelligence (AI), which is created to imitate and eventually exceed human behaviors, has exhibited its power in learning high-level representations and recognizing complicated patterns with exceptional automaticity. Therefore, facing up with the intertwining influential factors and explosive data size, AI has been increasingly leveraged in all the above-mentioned aspects in SERS, presenting elite efficiency in accelerating systematic optimization and deepening understanding about the fundamental physics and spectral data, which far transcends human labors and conventional computations. In this review, the recent progresses in SERS are summarized through the integration of AI, and new insights of the challenges and perspectives are provided in aim to better gear SERS toward the fast track.
Collapse
Affiliation(s)
- Xinyuan Bi
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Li Lin
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Zhou Chen
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
| | - Jian Ye
- State Key Laboratory of Systems Medicine for Cancer, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030, P. R. China
- Institute of Medical Robotics, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
- Shanghai Key Laboratory of Gynecologic Oncology, Ren Ji Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, 200127, P. R. China
| |
Collapse
|
31
|
Qin R, Zhang H, Huang W, Shao Z, Lei J. Deep learning-based design and screening of benzimidazole-pyrazine derivatives as adenosine A 2B receptor antagonists. J Biomol Struct Dyn 2023:1-17. [PMID: 38133953 DOI: 10.1080/07391102.2023.2295974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 12/11/2023] [Indexed: 12/24/2023]
Abstract
The Adenosine A2B receptor (A2BAR) is considered a novel potential target for the immunotherapy of cancer, and A2BAR antagonists have an inhibitory effect on tumor growth, proliferation, and metastasis. In our previous studies, we identified a class of benzimidazole-pyrazine scaffolds whose derivatives exhibited the antagonistic effect but lacked subtype selectivity towards A2BAR. In this work, we developed a scaffold-based protocol that incorporates a deep generative model and multilayer virtual screening to design benzimidazole-pyrazine derivatives as potential selective A2BAR antagonists. By utilizing a generative model with reported A2BAR antagonists as the training set, we built up a scaffold-focused library of benzimidazole-pyrazine derivatives and processed a virtual screening protocol to discover potential A2BAR antagonists. Finally, five molecules with different Bemis-Murcko scaffolds were identified and exhibited higher binding free energies than the reference molecule 12o. Further computational analysis revealed that the 3-benzyl derivative ABA-1266 presented high selectivity toward A2BAR and showed preferred draggability, providing future potent development of selective A2BAR antagonists.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Rui Qin
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Hao Zhang
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| | - Weifeng Huang
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| | - Zhenglin Shao
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| | - Jinping Lei
- School of Pharmaceutical Sciences, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
32
|
Day EC, Chittari SS, Bogen MP, Knight AS. Navigating the Expansive Landscapes of Soft Materials: A User Guide for High-Throughput Workflows. ACS POLYMERS AU 2023; 3:406-427. [PMID: 38107416 PMCID: PMC10722570 DOI: 10.1021/acspolymersau.3c00025] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 11/02/2023] [Accepted: 11/07/2023] [Indexed: 12/19/2023]
Abstract
Synthetic polymers are highly customizable with tailored structures and functionality, yet this versatility generates challenges in the design of advanced materials due to the size and complexity of the design space. Thus, exploration and optimization of polymer properties using combinatorial libraries has become increasingly common, which requires careful selection of synthetic strategies, characterization techniques, and rapid processing workflows to obtain fundamental principles from these large data sets. Herein, we provide guidelines for strategic design of macromolecule libraries and workflows to efficiently navigate these high-dimensional design spaces. We describe synthetic methods for multiple library sizes and structures as well as characterization methods to rapidly generate data sets, including tools that can be adapted from biological workflows. We further highlight relevant insights from statistics and machine learning to aid in data featurization, representation, and analysis. This Perspective acts as a "user guide" for researchers interested in leveraging high-throughput screening toward the design of multifunctional polymers and predictive modeling of structure-property relationships in soft materials.
Collapse
Affiliation(s)
| | | | - Matthew P. Bogen
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Abigail S. Knight
- Department of Chemistry, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
33
|
McGibbon M, Shave S, Dong J, Gao Y, Houston DR, Xie J, Yang Y, Schwaller P, Blay V. From intuition to AI: evolution of small molecule representations in drug discovery. Brief Bioinform 2023; 25:bbad422. [PMID: 38033290 PMCID: PMC10689004 DOI: 10.1093/bib/bbad422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/13/2023] [Accepted: 11/01/2023] [Indexed: 12/02/2023] Open
Abstract
Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Steven Shave
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jie Dong
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, China
| | - Yumiao Gao
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jiancong Xie
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Vincent Blay
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| |
Collapse
|
34
|
Essen CV, Luedeker D. In silico co-crystal design: Assessment of the latest advances. Drug Discov Today 2023; 28:103763. [PMID: 37689178 DOI: 10.1016/j.drudis.2023.103763] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 08/18/2023] [Accepted: 08/31/2023] [Indexed: 09/11/2023]
Abstract
Pharmaceutical co-crystals represent a growing class of crystal forms in the context of pharmaceutical science. They are attractive to pharmaceutical scientists because they significantly expand the number of crystal forms that exist for an active pharmaceutical ingredient and can lead to improvements in physicochemical properties of clinical relevance. At the same time, machine learning is finding its way into all areas of drug discovery and delivers impressive results. In this review, we attempt to provide an overview of machine learning, deep learning and network-based recommendation approaches applied to pharmaceutical co-crystallization. We also present crystal structure prediction as an alternative to machine learning approaches.
Collapse
|
35
|
John L, Nagamani S, Mahanta HJ, Vaikundamani S, Kumar N, Kumar A, Jamir E, Priyadarsinee L, Sastry GN. Molecular Property Diagnostic Suite Compound Library (MPDS-CL): a structure-based classification of the chemical space. Mol Divers 2023:10.1007/s11030-023-10752-1. [PMID: 37902900 DOI: 10.1007/s11030-023-10752-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Accepted: 10/17/2023] [Indexed: 11/01/2023]
Abstract
Molecular Property Diagnostic Suite Compound Library (MPDS-CL) is an open-source Galaxy-based cheminformatics web portal which presents a structure-based classification of the molecules. A structure-based classification of nearly 150 million unique compounds, obtained from 42 publicly available databases and curated for redundancy removal through 97 hierarchically well-defined atom composition-based portions, has been done. These are further subjected to 56-bit fingerprint-based classification algorithm which led to the formation of 56 structurally well-defined classes. The classes thus obtained were further divided into clusters based on their molecular weight. Thus, the entire set of molecules was put into 56 different classes and 625 clusters. This led to the assignment of a unique ID, named as MPDS-AadharID, for each of these 149,169,443 molecules. MPDS-AadharID is akin to the unique number given to citizens in India (similar to SSN in the US and NINO in the UK). The unique features of MPDS-CL are (a) several search options, such as exact structure search, substructure search, property-based search, fingerprint-based search, using SMILES, InChIKey and key-in; (b) automatic generation of information for the processing for MPDS and other galaxy tools; (c) providing the class and cluster of a molecule which makes it easier and fast to search for similar molecules and (d) information related to the presence of the molecules in multiple databases. The MPDS-CL can be accessed at https://mpds.neist.res.in:8086/ .
Collapse
Affiliation(s)
- Lijo John
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Selvaraman Nagamani
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Hridoy Jyoti Mahanta
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - S Vaikundamani
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
| | - Nandan Kumar
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Asheesh Kumar
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
| | - Esther Jamir
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Lipsa Priyadarsinee
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - G Narahari Sastry
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India.
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India.
| |
Collapse
|
36
|
Kichev I, Borislavov L, Tadjer A, Stoyanova R. Machine Learning Prediction of the Redox Activity of Quinones. MATERIALS (BASEL, SWITZERLAND) 2023; 16:6687. [PMID: 37895669 PMCID: PMC10608659 DOI: 10.3390/ma16206687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 10/09/2023] [Accepted: 10/11/2023] [Indexed: 10/29/2023]
Abstract
The redox properties of quinones underlie their unique characteristics as organic battery components that outperform the conventional inorganic ones. Furthermore, these redox properties could be precisely tuned by using different substituent groups. Machine learning and statistics, on the other hand, have proven to be very powerful approaches for the efficient in silico design of novel materials. Herein, we demonstrated the machine learning approach for the prediction of the redox activity of quinones that potentially can serve as organic battery components. For the needs of the present study, a database of small quinone-derived molecules was created. A large number of quantum chemical and chemometric descriptors were generated for each molecule and, subsequently, different statistical approaches were applied to select the descriptors that most prominently characterized the relationship between the structure and the redox potential. Various machine learning methods for the screening of prospective organic battery electrode materials were deployed to select the most trustworthy strategy for the machine learning-aided design of organic redox materials. It was found that Ridge regression models perform better than Regression decision trees and Decision tree-based ensemble algorithms.
Collapse
Affiliation(s)
- Ilia Kichev
- Institute of General and Inorganic Chemistry, Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria; (I.K.); (R.S.)
- Faculty of Chemistry and Pharmacy, University of Sofia, 1164 Sofia, Bulgaria
| | - Lyuben Borislavov
- Institute of General and Inorganic Chemistry, Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria; (I.K.); (R.S.)
| | - Alia Tadjer
- Faculty of Chemistry and Pharmacy, University of Sofia, 1164 Sofia, Bulgaria
| | - Radostina Stoyanova
- Institute of General and Inorganic Chemistry, Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria; (I.K.); (R.S.)
| |
Collapse
|
37
|
Li J, Wu N, Zhang J, Wu HH, Pan K, Wang Y, Liu G, Liu X, Yao Z, Zhang Q. Machine Learning-Assisted Low-Dimensional Electrocatalysts Design for Hydrogen Evolution Reaction. NANO-MICRO LETTERS 2023; 15:227. [PMID: 37831203 PMCID: PMC10575847 DOI: 10.1007/s40820-023-01192-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 08/10/2023] [Indexed: 10/14/2023]
Abstract
Efficient electrocatalysts are crucial for hydrogen generation from electrolyzing water. Nevertheless, the conventional "trial and error" method for producing advanced electrocatalysts is not only cost-ineffective but also time-consuming and labor-intensive. Fortunately, the advancement of machine learning brings new opportunities for electrocatalysts discovery and design. By analyzing experimental and theoretical data, machine learning can effectively predict their hydrogen evolution reaction (HER) performance. This review summarizes recent developments in machine learning for low-dimensional electrocatalysts, including zero-dimension nanoparticles and nanoclusters, one-dimensional nanotubes and nanowires, two-dimensional nanosheets, as well as other electrocatalysts. In particular, the effects of descriptors and algorithms on screening low-dimensional electrocatalysts and investigating their HER performance are highlighted. Finally, the future directions and perspectives for machine learning in electrocatalysis are discussed, emphasizing the potential for machine learning to accelerate electrocatalyst discovery, optimize their performance, and provide new insights into electrocatalytic mechanisms. Overall, this work offers an in-depth understanding of the current state of machine learning in electrocatalysis and its potential for future research.
Collapse
Affiliation(s)
- Jin Li
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China
| | - Naiteng Wu
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China
| | - Jian Zhang
- New Energy Technology Engineering Lab of Jiangsu Province, College of Science, Nanjing University of Posts and Telecommunications (NUPT), Nanjing, 210023, People's Republic of China
| | - Hong-Hui Wu
- School of Materials Science and Engineering, University of Science and Technology Beijing, Beijing, 100083, People's Republic of China.
- Department of Chemistry, University of Nebraska-Lincoln, Lincoln, NE, 8588, USA.
| | - Kunming Pan
- Henan Key Laboratory of High-Temperature Structural and Functional Materials, National Joint Engineering Research Center for Abrasion Control and Molding of Metal Materials, Henan University of Science and Technology, Luoyang, 471003, People's Republic of China
| | - Yingxue Wang
- National Engineering Laboratory for Risk Perception and Prevention, Beijing, 100041, People's Republic of China.
| | - Guilong Liu
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China
| | - Xianming Liu
- College of Chemistry and Chemical Engineering, and Henan Key Laboratory of Function-Oriented Porous Materials, Luoyang Normal University, Luoyang, 471934, People's Republic of China.
| | - Zhenpeng Yao
- Center of Hydrogen Science, Shanghai Jiao Tong University, Shanghai, 200000, People's Republic of China
- State Key Laboratory of Metal Matrix Composites, School of Materials Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200000, People's Republic of China
| | - Qiaobao Zhang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Materials, Xiamen University, Xiamen, 361005, People's Republic of China.
| |
Collapse
|
38
|
Shilpa S, Kashyap G, Sunoj RB. Recent Applications of Machine Learning in Molecular Property and Chemical Reaction Outcome Predictions. J Phys Chem A 2023; 127:8253-8271. [PMID: 37769193 DOI: 10.1021/acs.jpca.3c04779] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/30/2023]
Abstract
Burgeoning developments in machine learning (ML) and its rapidly growing adaptations in chemistry are noteworthy. Motivated by the successful deployments of ML in the realm of molecular property prediction (MPP) and chemical reaction prediction (CRP), herein we highlight some of its most recent applications in predictive chemistry. We present a nonmathematical and concise overview of the progression of ML implementations, ranging from an ensemble-based random forest model to advanced graph neural network algorithms. Similarly, the prospects of various feature engineering and feature learning approaches that work in conjunction with ML models are described. Highly accurate predictions reported in MPP tasks (e.g., lipophilicity, solubility, distribution coefficient), using methods such as D-MPNN, MolCLR, SMILES-BERT, and MolBERT, offer promising avenues in molecular design and drug discovery. Whereas MPP pertains to a given molecule, ML applications in chemical reactions present a different level of challenge, primarily arising from the simultaneous involvement of multiple molecules and their diverse roles in a reaction setting. The reported RMSEs in MPP tasks range from 0.287 to 2.20, while those for yield predictions are well over 4.9 in the lower end, reaching thresholds of >10.0 in several examples. Our Review concludes with a set of persisting challenges in dealing with reaction data sets and an overall optimistic outlook on benefits of ML-driven workflows for various MPP as well as CRP tasks.
Collapse
Affiliation(s)
- Shilpa Shilpa
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Gargee Kashyap
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| |
Collapse
|
39
|
Gao J, Shen Z, Xie Y, Lu J, Lu Y, Chen S, Bian Q, Guo Y, Shen L, Wu J, Zhou B, Hou T, He Q, Che J, Dong X. TransFoxMol: predicting molecular property with focused attention. Brief Bioinform 2023; 24:bbad306. [PMID: 37605947 DOI: 10.1093/bib/bbad306] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/17/2023] [Accepted: 08/04/2023] [Indexed: 08/23/2023] Open
Abstract
Predicting the biological properties of molecules is crucial in computer-aided drug development, yet it's often impeded by data scarcity and imbalance in many practical applications. Existing approaches are based on self-supervised learning or 3D data and using an increasing number of parameters to improve performance. These approaches may not take full advantage of established chemical knowledge and could inadvertently introduce noise into the respective model. In this study, we introduce a more elegant transformer-based framework with focused attention for molecular representation (TransFoxMol) to improve the understanding of artificial intelligence (AI) of molecular structure property relationships. TransFoxMol incorporates a multi-scale 2D molecular environment into a graph neural network + Transformer module and uses prior chemical maps to obtain a more focused attention landscape compared to that obtained using existing approaches. Experimental results show that TransFoxMol achieves state-of-the-art performance on MoleculeNet benchmarks and surpasses the performance of baselines that use self-supervised learning or geometry-enhanced strategies on small-scale datasets. Subsequent analyses indicate that TransFoxMol's predictions are highly interpretable and the clever use of chemical knowledge enables AI to perceive molecules in a simple but rational way, enhancing performance.
Collapse
Affiliation(s)
- Jian Gao
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Zheyuan Shen
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yufeng Xie
- School of Software Technology, Zhejiang University, Hangzhou, China
| | - Jialiang Lu
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yang Lu
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Sikang Chen
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Qingyu Bian
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yue Guo
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
| | - Liteng Shen
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jian Wu
- School of Software Technology, Zhejiang University, Hangzhou, China
| | - Binbin Zhou
- Department of Computer Science and Computing, Zhejiang University City College, Hangzhou, China
| | - Tingjun Hou
- State Key Lab of CAD&CG, College of Pharmaceutical Sciences, Zhejiang University, Zhejiang, China
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
| | - Qiaojun He
- Institute of Pharmacology & Toxicology, Zhejiang Province Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, PR China
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
- Centre for Drug Safety Evaluation and Research of ZJU, Hangzhou, 310058, PR China
- Cancer Center of Zhejiang University, Hangzhou, China
| | - Jinxin Che
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Xiaowu Dong
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
- Cancer Center of Zhejiang University, Hangzhou, China
- Department of Pharmacy, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| |
Collapse
|
40
|
Zhang Y, Yu J, Song H, Yang M. Structure-Based Reaction Descriptors for Predicting Rate Constants by Machine Learning: Application to Hydrogen Abstraction from Alkanes by CH 3/H/O Radicals. J Chem Inf Model 2023; 63:5097-5106. [PMID: 37561569 DOI: 10.1021/acs.jcim.3c00892] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/11/2023]
Abstract
Accurate determination of the thermal rate constants for combustion reactions is a highly challenging task, both experimentally and theoretically. Machine learning has been proven to be a powerful tool to predict reaction rate constants in recent years. In this work, three supervised machine learning algorithms, including XGB, FNN, and XGB-FNN, are used to develop quantitative structure-property relationship models for the estimation of the rate constants of hydrogen abstraction reactions from alkanes by the free radicals CH3, H, and O. The molecular similarity based on Morgan molecular fingerprints combined with the topological indices are proposed to represent chemical reactions in the machine learning models. Using the newly constructed descriptors, the hybrid XGB-FNN algorithm yields average deviations of 65.4%, 12.1%, and 64.5% on the prediction sets of alkanes + CH3, H, and O, respectively, whose performance is comparable and even superior to the corresponding one using the activation energy as a descriptor. The use of activation energy as a descriptor has previously been shown to significantly improve prediction accuracy ( Fuel 2022, 322, 124150) but typically requires cumbersome ab initio calculations. In addition, the XGB-FNN models could reasonably predict reaction rate constants of hydrogen abstractions from different sites of alkanes and their isomers, indicating a good generalization ability. It is expected that the reaction descriptors proposed in this work can be applied to build machine learning models for other reactions.
Collapse
Affiliation(s)
- Yu Zhang
- College of Physical Science and Technology, Huazhong Normal University, Wuhan 430079, China
- Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan 430071, China
| | - Jinhui Yu
- Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan 430071, China
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Hongwei Song
- Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan 430071, China
| | - Minghui Yang
- Key Laboratory of Magnetic Resonance in Biological Systems, State Key Laboratory of Magnetic Resonance and Atomic and Molecular Physics, National Center for Magnetic Resonance in Wuhan, Wuhan Institute of Physics and Mathematics, Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences, Wuhan 430071, China
- Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| |
Collapse
|
41
|
Hagg A, Kirschner KN. Open-Source Machine Learning in Computational Chemistry. J Chem Inf Model 2023; 63:4505-4532. [PMID: 37466636 PMCID: PMC10430767 DOI: 10.1021/acs.jcim.3c00643] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Indexed: 07/20/2023]
Abstract
The field of computational chemistry has seen a significant increase in the integration of machine learning concepts and algorithms. In this Perspective, we surveyed 179 open-source software projects, with corresponding peer-reviewed papers published within the last 5 years, to better understand the topics within the field being investigated by machine learning approaches. For each project, we provide a short description, the link to the code, the accompanying license type, and whether the training data and resulting models are made publicly available. Based on those deposited in GitHub repositories, the most popular employed Python libraries are identified. We hope that this survey will serve as a resource to learn about machine learning or specific architectures thereof by identifying accessible codes with accompanying papers on a topic basis. To this end, we also include computational chemistry open-source software for generating training data and fundamental Python libraries for machine learning. Based on our observations and considering the three pillars of collaborative machine learning work, open data, open source (code), and open models, we provide some suggestions to the community.
Collapse
Affiliation(s)
- Alexander Hagg
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Electrical Engineering, Mechanical Engineering and Technical Journalism, University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| | - Karl N. Kirschner
- Institute
of Technology, Resource and Energy-Efficient Engineering (TREE), University of Applied Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
- Department
of Computer Science, University of Applied
Sciences Bonn-Rhein-Sieg, 53757 Sankt Augustin, Germany
| |
Collapse
|
42
|
Papastergiou T, Azé J, Bringay S, Louet M, Poncelet P, Rosales-Hurtado M, Vo-Hoang Y, Licznar-Fajardo P, Docquier JD, Gavara L. Discovering NDM-1 inhibitors using molecular substructure embeddings representations. J Integr Bioinform 2023; 0:jib-2022-0050. [PMID: 37498676 PMCID: PMC10389050 DOI: 10.1515/jib-2022-0050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 06/12/2023] [Indexed: 07/29/2023] Open
Abstract
NDM-1 (New-Delhi-Metallo-β-lactamase-1) is an enzyme developed by bacteria that is implicated in bacteria resistance to almost all known antibiotics. In this study, we deliver a new, curated NDM-1 bioactivities database, along with a set of unifying rules for managing different activity properties and inconsistencies. We define the activity classification problem in terms of Multiple Instance Learning, employing embeddings corresponding to molecular substructures and present an ensemble ranking and classification framework, relaying on a k-fold Cross Validation method employing a per fold hyper-parameter optimization procedure, showing promising generalization ability. The MIL paradigm displayed an improvement up to 45.7 %, in terms of Balanced Accuracy, in comparison to the classical Machine Learning paradigm. Moreover, we investigate different compact molecular representations, based on atomic or bi-atomic substructures. Finally, we scanned the Drugbank for strongly active compounds and we present the top-15 ranked compounds.
Collapse
Affiliation(s)
- Thomas Papastergiou
- LIRMM, University of Montpellier, CNRS, 34095 Montpellier, France
- IBMM, CNRS, University of Montpellier, ENSCM, 34293 Montpellier, France
| | - Jérôme Azé
- LIRMM, University of Montpellier, CNRS, 34095 Montpellier, France
| | - Sandra Bringay
- LIRMM, University of Montpellier, CNRS, 34095 Montpellier, France
- AMIS, Paul Valery University, 34199 Montpellier, France
| | - Maxime Louet
- IBMM, CNRS, University of Montpellier, ENSCM, 34293 Montpellier, France
| | - Pascal Poncelet
- LIRMM, University of Montpellier, CNRS, 34095 Montpellier, France
| | | | - Yen Vo-Hoang
- IBMM, CNRS, University of Montpellier, ENSCM, 34293 Montpellier, France
| | | | - Jean-Denis Docquier
- Department of Medical Biotechnologies, University of Siena, I-53100 Siena, Italy
| | - Laurent Gavara
- IBMM, CNRS, University of Montpellier, ENSCM, 34293 Montpellier, France
| |
Collapse
|
43
|
Szulc NA, Mackiewicz Z, Bujnicki JM, Stefaniak F. Structural interaction fingerprints and machine learning for predicting and explaining binding of small molecule ligands to RNA. Brief Bioinform 2023; 24:bbad187. [PMID: 37204195 DOI: 10.1093/bib/bbad187] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 04/07/2023] [Accepted: 04/25/2023] [Indexed: 05/20/2023] Open
Abstract
Ribonucleic acids (RNAs) play crucial roles in living organisms and some of them, such as bacterial ribosomes and precursor messenger RNA, are targets of small molecule drugs, whereas others, e.g. bacterial riboswitches or viral RNA motifs are considered as potential therapeutic targets. Thus, the continuous discovery of new functional RNA increases the demand for developing compounds targeting them and for methods for analyzing RNA-small molecule interactions. We recently developed fingeRNAt-a software for detecting non-covalent bonds formed within complexes of nucleic acids with different types of ligands. The program detects several non-covalent interactions and encodes them as structural interaction fingerprint (SIFt). Here, we present the application of SIFts accompanied by machine learning methods for binding prediction of small molecules to RNA. We show that SIFt-based models outperform the classic, general-purpose scoring functions in virtual screening. We also employed Explainable Artificial Intelligence (XAI)-the SHapley Additive exPlanations, Local Interpretable Model-agnostic Explanations and other methods to help understand the decision-making process behind the predictive models. We conducted a case study in which we applied XAI on a predictive model of ligand binding to human immunodeficiency virus type 1 trans-activation response element RNA to distinguish between residues and interaction types important for binding. We also used XAI to indicate whether an interaction has a positive or negative effect on binding prediction and to quantify its impact. Our results obtained using all XAI methods were consistent with the literature data, demonstrating the utility and importance of XAI in medicinal chemistry and bioinformatics.
Collapse
Affiliation(s)
- Natalia A Szulc
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
- Laboratory of Protein Metabolism, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| | - Zuzanna Mackiewicz
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
- Laboratory of RNA Biology - ERA Chairs Group, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| | - Janusz M Bujnicki
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| | - Filip Stefaniak
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, 4 Ks. Trojdena Str, 02-109 Warsaw, Poland
| |
Collapse
|
44
|
Dou B, Zhu Z, Merkurjev E, Ke L, Chen L, Jiang J, Zhu Y, Liu J, Zhang B, Wei GW. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem Rev 2023; 123:8736-8780. [PMID: 37384816 PMCID: PMC10999174 DOI: 10.1021/acs.chemrev.3c00189] [Citation(s) in RCA: 21] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023]
Abstract
Small data are often used in scientific and engineering research due to the presence of various constraints, such as time, cost, ethics, privacy, security, and technical limitations in data acquisition. However, big data have been the focus for the past decade, small data and their challenges have received little attention, even though they are technically more severe in machine learning (ML) and deep learning (DL) studies. Overall, the small data challenge is often compounded by issues, such as data diversity, imputation, noise, imbalance, and high-dimensionality. Fortunately, the current big data era is characterized by technological breakthroughs in ML, DL, and artificial intelligence (AI), which enable data-driven scientific discovery, and many advanced ML and DL technologies developed for big data have inadvertently provided solutions for small data problems. As a result, significant progress has been made in ML and DL for small data challenges in the past decade. In this review, we summarize and analyze several emerging potential solutions to small data challenges in molecular science, including chemical and biological sciences. We review both basic machine learning algorithms, such as linear regression, logistic regression (LR), k-nearest neighbor (KNN), support vector machine (SVM), kernel learning (KL), random forest (RF), and gradient boosting trees (GBT), and more advanced techniques, including artificial neural network (ANN), convolutional neural network (CNN), U-Net, graph neural network (GNN), Generative Adversarial Network (GAN), long short-term memory (LSTM), autoencoder, transformer, transfer learning, active learning, graph-based semi-supervised learning, combining deep learning with traditional machine learning, and physical model-based data augmentation. We also briefly discuss the latest advances in these methods. Finally, we conclude the survey with a discussion of promising trends in small data challenges in molecular science.
Collapse
Affiliation(s)
- Bozheng Dou
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Zailiang Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Ekaterina Merkurjev
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Long Chen
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Jie Liu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences,Wuhan Textile University, Wuhan 430200, P, R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan 48824, United States
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, United States
| |
Collapse
|
45
|
Taylor CJ, Felton KC, Wigh D, Jeraal MI, Grainger R, Chessari G, Johnson CN, Lapkin AA. Accelerated Chemical Reaction Optimization Using Multi-Task Learning. ACS CENTRAL SCIENCE 2023; 9:957-968. [PMID: 37252348 PMCID: PMC10214532 DOI: 10.1021/acscentsci.3c00050] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Indexed: 05/31/2023]
Abstract
Functionalization of C-H bonds is a key challenge in medicinal chemistry, particularly for fragment-based drug discovery (FBDD) where such transformations require execution in the presence of polar functionality necessary for protein binding. Recent work has shown the effectiveness of Bayesian optimization (BO) for the self-optimization of chemical reactions; however, in all previous cases these algorithmic procedures have started with no prior information about the reaction of interest. In this work, we explore the use of multitask Bayesian optimization (MTBO) in several in silico case studies by leveraging reaction data collected from historical optimization campaigns to accelerate the optimization of new reactions. This methodology was then translated to real-world, medicinal chemistry applications in the yield optimization of several pharmaceutical intermediates using an autonomous flow-based reactor platform. The use of the MTBO algorithm was shown to be successful in determining optimal conditions of unseen experimental C-H activation reactions with differing substrates, demonstrating an efficient optimization strategy with large potential cost reductions when compared to industry-standard process optimization techniques. Our findings highlight the effectiveness of the methodology as an enabling tool in medicinal chemistry workflows, representing a step-change in the utilization of data and machine learning with the goal of accelerated reaction optimization.
Collapse
Affiliation(s)
- Connor J. Taylor
- Astex
Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge, CB4 0QA, United Kingdom
- Innovation
Centre in Digital Molecular Technologies, Yusuf Hamied Department
of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, United
Kingdom
| | - Kobi C. Felton
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, United Kingdom
| | - Daniel Wigh
- Innovation
Centre in Digital Molecular Technologies, Yusuf Hamied Department
of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, United
Kingdom
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, United Kingdom
| | - Mohammed I. Jeraal
- Cambridge
Centre for Advanced Research and Education in Singapore Ltd., 1 Create Way, CREATE Tower #05-05, 138602, Singapore
| | - Rachel Grainger
- Astex
Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge, CB4 0QA, United Kingdom
| | - Gianni Chessari
- Astex
Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge, CB4 0QA, United Kingdom
| | - Christopher N. Johnson
- Astex
Pharmaceuticals, 436 Cambridge Science Park, Milton Road, Cambridge, CB4 0QA, United Kingdom
| | - Alexei A. Lapkin
- Innovation
Centre in Digital Molecular Technologies, Yusuf Hamied Department
of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, United
Kingdom
- Department
of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, United Kingdom
- Cambridge
Centre for Advanced Research and Education in Singapore Ltd., 1 Create Way, CREATE Tower #05-05, 138602, Singapore
| |
Collapse
|
46
|
Guha R, Velegol D. Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties. J Cheminform 2023; 15:54. [PMID: 37211605 DOI: 10.1186/s13321-023-00712-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Accepted: 03/18/2023] [Indexed: 05/23/2023] Open
Abstract
Accurate prediction of molecular properties is essential in the screening and development of drug molecules and other functional materials. Traditionally, property-specific molecular descriptors are used in machine learning models. This in turn requires the identification and development of target or problem-specific descriptors. Additionally, an increase in the prediction accuracy of the model is not always feasible from the standpoint of targeted descriptor usage. We explored the accuracy and generalizability issues using a framework of Shannon entropies, based on SMILES, SMARTS and/or InChiKey strings of respective molecules. Using various public databases of molecules, we showed that the accuracy of the prediction of machine learning models could be significantly enhanced simply by using Shannon entropy-based descriptors evaluated directly from SMILES. Analogous to partial pressures and total pressure of gases in a mixture, we used atom-wise fractional Shannon entropy in combination with total Shannon entropy from respective tokens of the string representation to model the molecule efficiently. The proposed descriptor was competitive in performance with standard descriptors such as Morgan fingerprints and SHED in regression models. Additionally, we found that either a hybrid descriptor set containing the Shannon entropy-based descriptors or an optimized, ensemble architecture of multilayer perceptrons and graph neural networks using the Shannon entropies was synergistic to improve the prediction accuracy. This simple approach of coupling the Shannon entropy framework to other standard descriptors and/or using it in ensemble models could find applications in boosting the performance of molecular property predictions in chemistry and material science.
Collapse
Affiliation(s)
- Rajarshi Guha
- Intel Corporation, 2501 NE Century Blvd, Hillsboro, OR, 97124, USA.
| | - Darrell Velegol
- Department of Chemical Engineering, Pennsylvania State University, University Park, PA, 16802, USA
| |
Collapse
|
47
|
Chhaganlal MN, Underhaug J, Mjøs SA. Evaluation of NMR predictors for accuracy and ability to reveal trends in 1 H NMR spectra of fatty acids. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2023; 61:318-332. [PMID: 36759332 DOI: 10.1002/mrc.5336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2023] [Revised: 02/04/2023] [Accepted: 02/07/2023] [Indexed: 06/18/2023]
Abstract
Four different nuclear magnetic resonance (NMR) predictors have been evaluated for their ability to predict 600-MHz 1 H spectra of free fatty acids and fatty acid methyl esters of 20 common fatty acids. The predictors were evaluated on two main criteria: (1) their accuracy in direct prediction of the spectra (absolute accuracy) and (2) the ability to reveal trends or predict the change that occurs in the spectra as a result of a change in the fatty acid carbon chain, or by esterification of the free fatty acids to methyl esters (relative accuracy). The absolute accuracy in chemical shift prediction for fatty acids was good, compared with previous reports on a broader range of compounds. All four predictors had median prediction errors for chemical shifts of the signals in fatty acid methyl esters well below 0.1 ppm and as low as 0.015 ppm for one of the predictors. However, all predictors also had outliers with errors far above the upper interquartile range. In general, they also fail to reproduce trends of diagnostic value that were observed in the experimental data or properly predict the result of a minor change in molecular structure. All four predictors depend on experimental data from different origins. This may be a limiting factor for the relative accuracy of the predictors.
Collapse
Affiliation(s)
| | - Jarl Underhaug
- Department of Chemistry, University of Bergen, Bergen, Norway
| | - Svein A Mjøs
- Department of Chemistry, University of Bergen, Bergen, Norway
| |
Collapse
|
48
|
Shirokii N, Din Y, Petrov I, Seregin Y, Sirotenko S, Razlivina J, Serov N, Vinogradov V. Quantitative Prediction of Inorganic Nanomaterial Cellular Toxicity via Machine Learning. SMALL (WEINHEIM AN DER BERGSTRASSE, GERMANY) 2023; 19:e2207106. [PMID: 36772908 DOI: 10.1002/smll.202207106] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 01/09/2023] [Indexed: 05/11/2023]
Abstract
Organic chemistry has seen colossal progress due to machine learning (ML). However, the translation of artificial intelligence (AI) into materials science is challenging, where biological behavior prediction becomes even more complicated. Nanotoxicity is a critical parameter that describes their interaction with the living organisms screened in every bio-related research. To prevent excessive experiments, such properties have to be pre-evaluated. Several existing ML models partially fulfill the gap by predicting whether a nanomaterial is toxic or not. Yet, this binary categorization neglects the concentration dependencies crucial for experimental scientists. Here, an ML-based approach is proposed to the quantitative prediction of inorganic nanomaterial cytotoxicity achieving the precision expressed by 10-fold cross-validation (CV) Q2 = 0.86 with the root mean squared error (RMSE) of 12.2% obtained by the correlation-based feature selection and grid search-based model hyperparameters optimization. To provide further model flexibility, quantitative atom property-based nanomaterial descriptors are introduced allowing the model to extrapolate on unseen samples. Feature importance is calculated to find an interpretable model with optimal decision-making. These findings allow experimental scientists to perform primary in silico candidate screening and minimize the number of excessive, labor-intensive experiments enabling the rapid development of nanomaterials for medicinal purposes.
Collapse
Affiliation(s)
- Nikolai Shirokii
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, 191002, Saint-Petersburg, Russian Federation
| | - Yevgeniya Din
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, 191002, Saint-Petersburg, Russian Federation
| | - Ilya Petrov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, 191002, Saint-Petersburg, Russian Federation
| | - Yurii Seregin
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, 191002, Saint-Petersburg, Russian Federation
| | - Sofia Sirotenko
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, 191002, Saint-Petersburg, Russian Federation
| | - Julia Razlivina
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, 191002, Saint-Petersburg, Russian Federation
| | - Nikita Serov
- Advanced Engineering School, Almetyevsk State Oil Institute, Almetyevsk, Russia
| | - Vladimir Vinogradov
- International Institute "Solution Chemistry of Advanced Materials and Technologies", ITMO University, 191002, Saint-Petersburg, Russian Federation
| |
Collapse
|
49
|
Sarmiento Varón L, González-Puelma J, Medina-Ortiz D, Aldridge J, Alvarez-Saravia D, Uribe-Paredes R, Navarrete MA. The role of machine learning in health policies during the COVID-19 pandemic and in long COVID management. Front Public Health 2023; 11:1140353. [PMID: 37113165 PMCID: PMC10126380 DOI: 10.3389/fpubh.2023.1140353] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 03/20/2023] [Indexed: 04/29/2023] Open
Abstract
The ongoing COVID-19 pandemic is arguably one of the most challenging health crises in modern times. The development of effective strategies to control the spread of SARS-CoV-2 were major goals for governments and policy makers. Mathematical modeling and machine learning emerged as potent tools to guide and optimize the different control measures. This review briefly summarizes the SARS-CoV-2 pandemic evolution during the first 3 years. It details the main public health challenges focusing on the contribution of mathematical modeling to design and guide government action plans and spread mitigation interventions of SARS-CoV-2. Next describes the application of machine learning methods in a series of study cases, including COVID-19 clinical diagnosis, the analysis of epidemiological variables, and drug discovery by protein engineering techniques. Lastly, it explores the use of machine learning tools for investigating long COVID, by identifying patterns and relationships of symptoms, predicting risk indicators, and enabling early evaluation of COVID-19 sequelae.
Collapse
Affiliation(s)
| | - Jorge González-Puelma
- Centro Asistencial Docente y de Investigación, Universidad de Magallanes, Punta Arenas, Chile
- Escuela de Medicina, Universidad de Magallanes, Punta Arenas, Chile
| | - David Medina-Ortiz
- Departamento de Ingeniería en Computación, Facultad de Ingeniería, Universidad de Magallanes, Punta Arenas, Chile
| | - Jacqueline Aldridge
- Departamento de Ingeniería en Computación, Facultad de Ingeniería, Universidad de Magallanes, Punta Arenas, Chile
| | - Diego Alvarez-Saravia
- Centro Asistencial Docente y de Investigación, Universidad de Magallanes, Punta Arenas, Chile
- Escuela de Medicina, Universidad de Magallanes, Punta Arenas, Chile
| | - Roberto Uribe-Paredes
- Departamento de Ingeniería en Computación, Facultad de Ingeniería, Universidad de Magallanes, Punta Arenas, Chile
| | - Marcelo A. Navarrete
- Centro Asistencial Docente y de Investigación, Universidad de Magallanes, Punta Arenas, Chile
- Escuela de Medicina, Universidad de Magallanes, Punta Arenas, Chile
| |
Collapse
|
50
|
Wigh DS, Tissot M, Pasau P, Goodman JM, Lapkin AA. Quantitative In Silico Prediction of the Rate of Protodeboronation by a Mechanistic Density Functional Theory-Aided Algorithm. J Phys Chem A 2023; 127:2628-2636. [PMID: 36916916 PMCID: PMC10041635 DOI: 10.1021/acs.jpca.2c08250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/15/2023]
Abstract
Computational reaction prediction has become a ubiquitous task in chemistry due to the potential value accurate predictions can bring to chemists. Boronic acids are widely used in industry; however, understanding how to avoid the protodeboronation side reaction remains a challenge. We have developed an algorithm for in silico prediction of the rate of protodeboronation of boronic acids. A general mechanistic model devised through kinetic studies of protodeboronation was found in the literature and forms the foundation on which the algorithm presented in this work is built. Protodeboronation proceeds through 7 distinct pathways, though for any particular boronic acid, only a subset of mechanistic pathways are active. The rate of each active mechanistic pathway is linearly correlated with its characteristic energy difference, which in turn can be determined using Density Functional Theory. We validated the algorithm using leave-one-out cross-validation on a data set of 50 boronic acids and made a further 50 rate predictions on academically and industrially important boronic acids out of sample. We believe this work will provide great assistance to chemists performing reactions that feature boronic acids, such as Suzuki-Miyaura and Chan-Evans-Lam couplings.
Collapse
Affiliation(s)
- Daniel S Wigh
- Department of Chemical Engineering and Biotechnology, University of Cambridge, CB3 0AS Cambridge, U.K
| | | | | | - Jonathan M Goodman
- Yusuf Hamied Department of Chemistry, University of Cambridge, CB2 1EW Cambridge, U.K
| | - Alexei A Lapkin
- Department of Chemical Engineering and Biotechnology, University of Cambridge, CB3 0AS Cambridge, U.K
| |
Collapse
|