1
|
Le MHN, Nguyen PK, Nguyen TPT, Nguyen HQ, Tam DNH, Huynh HH, Huynh PK, Le NQK. An in-depth review of AI-powered advancements in cancer drug discovery. Biochim Biophys Acta Mol Basis Dis 2025; 1871:167680. [PMID: 39837431 DOI: 10.1016/j.bbadis.2025.167680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 01/12/2025] [Accepted: 01/16/2025] [Indexed: 01/23/2025]
Abstract
The convergence of artificial intelligence (AI) and genomics is redefining cancer drug discovery by facilitating the development of personalized and effective therapies. This review examines the transformative role of AI technologies, including deep learning and advanced data analytics, in accelerating key stages of the drug discovery process: target identification, drug design, clinical trial optimization, and drug response prediction. Cutting-edge tools such as DrugnomeAI and PandaOmics have made substantial contributions to therapeutic target identification, while AI's predictive capabilities are driving personalized treatment strategies. Additionally, advancements like AlphaFold highlight AI's capacity to address intricate challenges in drug development. However, the field faces significant challenges, including the management of large-scale genomic datasets and ethical concerns surrounding AI deployment in healthcare. This review underscores the promise of data-centric AI approaches and emphasizes the necessity of continued innovation and interdisciplinary collaboration. Together, AI and genomics are charting a path toward more precise, efficient, and transformative cancer therapeutics.
Collapse
Affiliation(s)
- Minh Huu Nhat Le
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan; AIBioMed Research Group, Taipei Medical University, Taipei 110, Taiwan
| | - Phat Ky Nguyen
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei, Taiwan; AIBioMed Research Group, Taipei Medical University, Taipei 110, Taiwan.
| | | | - Hien Quang Nguyen
- Cardiovascular Research Department, Methodist Hospital, Merrillville, IN 46410, USA
| | - Dao Ngoc Hien Tam
- Regulatory Affairs Department, Asia Shine Trading & Service Co. LTD, Viet Nam
| | - Han Hong Huynh
- International Master Program for Translational Science, College of Medical Science and Technology, Taipei Medical University, Taipei 110, Taiwan
| | - Phat Kim Huynh
- Department of Industrial and Systems Engineering, North Carolina A&T State University, Greensboro, NC 27411, USA.
| | - Nguyen Quoc Khanh Le
- AIBioMed Research Group, Taipei Medical University, Taipei 110, Taiwan; In-Service Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan; Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan.
| |
Collapse
|
2
|
Park J, Han M, Lee K, Park S. Hierarchical Graph Attention Network with Positive and Negative Attentions for Improved Interpretability: ISA-PN. J Chem Inf Model 2025; 65:1115-1127. [PMID: 39654089 DOI: 10.1021/acs.jcim.4c01035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/11/2025]
Abstract
With the advancement of deep learning (DL) methods in chemistry and materials science, the interpretability of DL models has become a critical issue in elucidating quantitative (molecular) structure-property relationships. Although attention mechanisms have been generally employed to explain the importance of molecular substructures that contribute to molecular properties, their interpretability remains limited. In this work, we introduce a versatile segmentation method and develop an interpretable subgraph attention (ISA) network with positive and negative streams (ISA-PN) to enhance the understanding of molecular structure-property relationships. The predictive performance of the ISA models was validated using data sets for aqueous solubility, lipophilicity, and melting temperature, with a particular focus on evaluating interpretability for the aqueous solubility data set. The ISA-PN model enables the quantification of the contributions of molecular substructures through positive and negative attention scores. Comparative analyses of the ISA, ISA-PN, and GC-Net (group contribution network) models demonstrate that the ISA-PN model significantly improves interpretability while maintaining similar accuracy levels. This study highlights the efficacy of the ISA-PN model in providing meaningful insights into the contributions of molecular substructures to molecular properties, thereby enhancing the interpretability of DL models in chemical applications.
Collapse
Affiliation(s)
- Jinyong Park
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| | - Minhi Han
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| | - Kiwoong Lee
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| | - Sungnam Park
- Department of Chemistry and Research Institute for Natural Science, Korea University, Seoul 02841, Korea
| |
Collapse
|
3
|
Zhou P, Zhou Q, Xiao X, Fan X, Zou Y, Sun L, Jiang J, Song D, Chen L. Machine Learning in Solid-State Hydrogen Storage Materials: Challenges and Perspectives. ADVANCED MATERIALS (DEERFIELD BEACH, FLA.) 2025; 37:e2413430. [PMID: 39703108 DOI: 10.1002/adma.202413430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 11/10/2024] [Indexed: 12/21/2024]
Abstract
Machine learning (ML) has emerged as a pioneering tool in advancing the research application of high-performance solid-state hydrogen storage materials (HSMs). This review summarizes the state-of-the-art research of ML in resolving crucial issues such as low hydrogen storage capacity and unfavorable de-/hydrogenation cycling conditions. First, the datasets, feature descriptors, and prevalent ML models tailored for HSMs are described. Specific examples include the successful application of ML in titanium-based, rare-earth-based, solid solution, magnesium-based, and complex HSMs, showcasing its role in exploiting composition-structure-property relationships and designing novel HSMs for specific applications. One of the representative ML works is the single-phase Ti-based HSM with superior cost-effective and comprehensive properties, tailored to fuel cell hydrogen feeding system at ambient temperature and pressure through high-throughput composition-performance scanning. More importantly, this review also identifies and critically analyzes the key challenges faced by ML in this domain, including poor data quality and availability, and the balance between model interpretability and accuracy, together with feasible countermeasures suggested to ameliorate these problems. In summary, this work outlines a roadmap for enhancing ML's utilization in solid-state hydrogen storage research, promoting more efficient and sustainable energy storage solutions.
Collapse
Affiliation(s)
- Panpan Zhou
- College of Materials Science and Engineering, Hohai University, Changzhou, Jiangsu, 213200, China
- State Key Laboratory of Silicon and Advanced Semiconductor Materials, School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Qianwen Zhou
- State Key Laboratory of Silicon and Advanced Semiconductor Materials, School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Xuezhang Xiao
- State Key Laboratory of Silicon and Advanced Semiconductor Materials, School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- School of Advanced Energy, Sun Yat-Sen University, Shenzhen, 518107, China
| | - Xiulin Fan
- State Key Laboratory of Silicon and Advanced Semiconductor Materials, School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
| | - Yongjin Zou
- Guangxi Key Laboratory of Information Materials, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Lixian Sun
- Guangxi Key Laboratory of Information Materials, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Jinghua Jiang
- College of Materials Science and Engineering, Hohai University, Changzhou, Jiangsu, 213200, China
| | - Dan Song
- College of Materials Science and Engineering, Hohai University, Changzhou, Jiangsu, 213200, China
| | - Lixin Chen
- State Key Laboratory of Silicon and Advanced Semiconductor Materials, School of Materials Science and Engineering, Zhejiang University, Hangzhou, Zhejiang, 310058, China
- Key Laboratory of Hydrogen Storage and Transportation Technology of Zhejiang Province, Hangzhou, Zhejiang, 310027, China
| |
Collapse
|
4
|
Moon H, Rho M. MultiChem: predicting chemical properties using multi-view graph attention network. BioData Min 2025; 18:4. [PMID: 39815309 PMCID: PMC11737097 DOI: 10.1186/s13040-024-00419-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Accepted: 12/26/2024] [Indexed: 01/18/2025] Open
Abstract
BACKGROUND Understanding the molecular properties of chemical compounds is essential for identifying potential candidates or ensuring safety in drug discovery. However, exploring the vast chemical space is time-consuming and costly, necessitating the development of time-efficient and cost-effective computational methods. Recent advances in deep learning approaches have offered deeper insights into molecular structures. Leveraging this progress, we developed a novel multi-view learning model. RESULTS We introduce a graph-integrated model that captures both local and global structural features of chemical compounds. In our model, graph attention layers are employed to effectively capture essential local structures by jointly considering atom and bond features, while multi-head attention layers extract important global features. We evaluated our model on nine MoleculeNet datasets, encompassing both classification and regression tasks, and compared its performance with state-of-the-art methods. Our model achieved an average area under the receiver operating characteristic (AUROC) of 0.822 and a root mean squared error (RMSE) of 1.133, representing a 3% improvement in AUROC and a 7% improvement in RMSE over state-of-the-art models in extensive seed testing. CONCLUSION MultiChem highlights the importance of integrating both local and global structural information in predicting molecular properties, while also assessing the stability of the models across multiple datasets using various random seed values. IMPLEMENTATION The codes are available at https://github.com/DMnBI/MultiChem .
Collapse
Affiliation(s)
- Heesang Moon
- Department of Computer Science, Hanyang University, Seoul, Republic of Korea
| | - Mina Rho
- Department of Computer Science, Hanyang University, Seoul, Republic of Korea.
- Department of Artificial Intelligence, Seoul, Republic of Korea.
- Department of Biomedical Informatics, Hanyang University, Seoul, Republic of Korea.
| |
Collapse
|
5
|
Feldmann CW, Sieg J, Mathea M. Analysis of uncertainty of neural fingerprint-based models. Faraday Discuss 2025; 256:551-567. [PMID: 39320108 DOI: 10.1039/d4fd00095a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024]
Abstract
Machine learning has gained popularity for predicting molecular properties based on molecular structure. This study explores the uncertainty estimates of neural fingerprint-based models by comparing pure graph neural networks (GNN) to classical machine learning algorithms combined with neural fingerprints. We investigate the advantage of extracting the neural fingerprint from the GNN and integrating it into a method known for producing better-calibrated probability estimates. Comparisons are made using three classical machine learning methods and the Chemprop model, considering different molecular representations and calibration techniques. We utilize 19 datasets from Toxcast, reflecting real-world scenarios with balanced accuracies ranging from 0.6 to 0.8. Results demonstrate that neural fingerprints combined with classical machine learning methods exhibit a slight decrease in prediction performance compared to the native Chemprop model. However, these models provide significantly improved uncertainty estimates. Notably, uncertainty estimates of neural fingerprint-based methods remain relatively robust for molecules dissimilar to the training set. This suggests that methods like random forest with neural fingerprints can deliver strong prediction performance and reliable uncertainty estimates. When considering both performance and uncertainty, the calibrated Chemprop model and the combination of neural fingerprints with random forest or support vector classifier (SVC) yield comparable results. Surprisingly, the SVC method shows promising performance when combined with neural or count fingerprints. These findings are particularly relevant in real-world industrial projects where accurate predictions and reliable uncertainty estimates are crucial.
Collapse
|
6
|
Morán-González L, Betten JE, Kneiding H, Balcells D. AABBA Graph Kernel: Atom-Atom, Bond-Bond, and Bond-Atom Autocorrelations for Machine Learning. J Chem Inf Model 2024; 64:8756-8769. [PMID: 39580812 PMCID: PMC11632777 DOI: 10.1021/acs.jcim.4c01583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/03/2024] [Accepted: 11/15/2024] [Indexed: 11/26/2024]
Abstract
Graphs are one of the most natural and powerful representations available for molecules; natural because they have an intuitive correspondence to skeletal formulas, the language used by chemists worldwide, and powerful, because they are highly expressive both globally (molecular topology) and locally (atom and bond properties). Graph kernels are used to transform molecular graphs into fixed-length vectors, which, based on their capacity of measuring similarity, can be used as fingerprints for machine learning (ML). To date, graph kernels have mostly focused on the atomic nodes of the graph. In this work, we developed a graph kernel based on atom-atom, bond-bond, and bond-atom (AABBA) autocorrelations. The resulting vector representations were tested on regression ML tasks on a data set of transition metal complexes; a benchmark motivated by the higher complexity of these compounds relative to organic molecules. In particular, we tested different flavors of the AABBA kernel in the prediction of the energy barriers and bond distances of the Vaska's complex data set (Friederich et al., Chem. Sci., 2020, 11, 4584). For a variety of ML models, including neural networks, gradient boosting machines, and Gaussian processes, we showed that AABBA outperforms the baseline including only atom-atom autocorrelations. Dimensionality reduction studies also showed that the bond-bond and bond-atom autocorrelations yield many of the most relevant features. We believe that the AABBA graph kernel can accelerate the exploration of large chemical spaces and inspire novel molecular representations in which both atomic and bond properties play an important role.
Collapse
Affiliation(s)
- Lucía Morán-González
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033 0315 Oslo, Norway
- Centre
for Materials Science and Nanotechnology, Department of Chemistry, University of Oslo, P.O.
Box 1033 0315 Oslo, Norway
| | - Jørn Eirik Betten
- Simula
Research Laboratory, Kristian Augusts Gate 23, 0164 Oslo, Norway
| | - Hannes Kneiding
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033 0315 Oslo, Norway
| | - David Balcells
- Hylleraas
Centre for Quantum Molecular Sciences, Department of Chemistry, University of Oslo, P.O. Box 1033 0315 Oslo, Norway
| |
Collapse
|
7
|
Etezadi F, Ito S, Yasui K, Kado Abdalkader R, Minami I, Uesugi M, Ganesh Pandian N, Nakano H, Nakano A, Packwood DM. Molecular Design for Cardiac Cell Differentiation Using a Small Data Set and Decorated Shape Features. J Chem Inf Model 2024; 64:8824-8837. [PMID: 39586080 DOI: 10.1021/acs.jcim.4c01353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2024]
Abstract
The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, streamline the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we present the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.
Collapse
Affiliation(s)
- Fatemeh Etezadi
- Institute for Integrated Cell-Material Sciences (iCeMS), Kyoto University, Kyoto 606-8501, Japan
| | - Shunichi Ito
- Institute for Integrated Cell-Material Sciences (iCeMS), Kyoto University, Kyoto 606-8501, Japan
- Faculty of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, Japan
| | - Kosuke Yasui
- Department of Applied Chemistry, Graduate School of Engineering, Osaka University, Osaka 565-0871, Japan
| | - Rodi Kado Abdalkader
- Ritsumeikan Global Innovation Research Organization (R-GIRO), Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan
| | | | - Motonari Uesugi
- Institute for Integrated Cell-Material Sciences (iCeMS), Kyoto University, Kyoto 606-8501, Japan
- Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | | | - Haruko Nakano
- Department of Molecular Cell and Developmental Biology, University of California Los Angeles, Los Angeles ,California90095, United States
| | - Atsushi Nakano
- Department of Molecular Cell and Developmental Biology, University of California Los Angeles, Los Angeles ,California90095, United States
- Division of Cardiology, Department of Medicine, University of California Los Angeles, Los Angeles , California90095, United States
- Eli and Edyth Broad Center for Stem Cell and Regenerative Medicine, University of California Los Angeles, Los Angeles, California90095, United States
- Department of Cell Physiology, School of Medicine, Jikei University, Tokyo 105-8461, Japan
| | - Daniel M Packwood
- Institute for Integrated Cell-Material Sciences (iCeMS), Kyoto University, Kyoto 606-8501, Japan
| |
Collapse
|
8
|
He Z, Yang W, Yang F, Zhang J, Ma L. Innovative medicinal chemistry strategies for enhancing drug solubility. Eur J Med Chem 2024; 279:116842. [PMID: 39260319 DOI: 10.1016/j.ejmech.2024.116842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 08/25/2024] [Accepted: 08/26/2024] [Indexed: 09/13/2024]
Abstract
Drug candidates with poor solubility have been recognized as the cause of many drug development failures, owing to the fact that low solubility is unfavorable for physicochemical, pharmacokinetic (PK) and pharmacodynamic (PD) properties. Given the imperative role of solubility during drug development, we herein summarize various strategies for solubility optimizations from a medicinal chemistry perspective, including introduction of polar group, salt formation, structural simplification, disruption of molecular planarity and symmetry, optimizations on the solvent exposed region as well as prodrug design. In addition, methods for solubility assessment and prediction are reviewed. Besides, we have deeply discussed the strategies for solubility improvement. This paper is expected to be beneficial for the development of drug-like molecules with good solubility.
Collapse
Affiliation(s)
- Zhangxu He
- Pharmacy College, Henan University of Chinese Medicine, 450046, Zhengzhou, China
| | - Weiguang Yang
- Children's Hospital Affiliated of Zhengzhou University, Henan Children's Hospital, Zhengzhou Children's Hospital, Henan, Zhengzhou, 450000, China
| | - Feifei Yang
- Pharmacy College, Henan University of Chinese Medicine, 450046, Zhengzhou, China
| | - Jingyu Zhang
- Pharmacy College, Henan University of Chinese Medicine, 450046, Zhengzhou, China.
| | - Liying Ma
- State Key Laboratory of Esophageal Cancer Prevention and Treatment, Key Laboratory of Advanced Drug Preparation Technologies, Ministry of Education, School of Pharmaceutical Sciences, Zhengzhou University, Zhengzhou, 450001, China; China Meheco Topfond Pharmaceutical Co., Zhumadian, 463000, China.
| |
Collapse
|
9
|
Li C, Luo Y, Xie Y, Zhang Z, Liu Y, Zou L, Xiao F. Structural and functional prediction, evaluation, and validation in the post-sequencing era. Comput Struct Biotechnol J 2024; 23:446-451. [PMID: 38223342 PMCID: PMC10787220 DOI: 10.1016/j.csbj.2023.12.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/20/2023] [Accepted: 12/22/2023] [Indexed: 01/16/2024] Open
Abstract
The surge of genome sequencing data has underlined substantial genetic variants of uncertain significance (VUS). The decryption of VUS discovered by sequencing poses a major challenge in the post-sequencing era. Although experimental assays have progressed in classifying VUS, only a tiny fraction of the human genes have been explored experimentally. Thus, it is urgently needed to generate state-of-the-art functional predictors of VUS in silico. Artificial intelligence (AI) is an invaluable tool to assist in the identification of VUS with high efficiency and accuracy. An increasing number of studies indicate that AI has brought an exciting acceleration in the interpretation of VUS, and our group has already used AI to develop protein structure-based prediction models. In this review, we provide an overview of the previous research on AI-based prediction of missense variants, and elucidate the challenges and opportunities for protein structure-based variant prediction in the post-sequencing era.
Collapse
Affiliation(s)
- Chang Li
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Yixuan Luo
- Beijing Normal University, Beijing, China
| | - Yibo Xie
- Information Center, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Zaifeng Zhang
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Ye Liu
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Lihui Zou
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
| | - Fei Xiao
- Clinical Biobank, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- The Key Laboratory of Geriatrics, Beijing Institute of Geriatrics, Beijing Hospital, National Center of Gerontology, National Health Commission, Institute of Geriatric Medicine, Chinese Academy of Medical Sciences, Beijing, China
- Beijing Normal University, Beijing, China
| |
Collapse
|
10
|
González Lastre M, Pou P, Wiche M, Ebeling D, Schirmeisen A, Pérez R. Molecular identification via molecular fingerprint extraction from atomic force microscopy images. J Cheminform 2024; 16:130. [PMID: 39587659 PMCID: PMC11587762 DOI: 10.1186/s13321-024-00921-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 10/26/2024] [Indexed: 11/27/2024] Open
Abstract
Non-Contact Atomic Force Microscopy with CO-functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR-AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024-bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR-AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR-AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.Scientific contributionPrevious works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub-optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.
Collapse
Affiliation(s)
- Manuel González Lastre
- Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid, E-28049, Madrid, Spain
| | - Pablo Pou
- Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid, E-28049, Madrid, Spain
- Condensed Matter Physics Center (IFIMAC), Universidad Autónoma de Madrid, E-28049, Madrid, Spain
| | - Miguel Wiche
- Institute of Applied Physics, Justus Liebig University Giessen, Giessen, Germany
- Center for Materials Research, Justus Liebig University Giessen, Giessen, Germany
| | - Daniel Ebeling
- Institute of Applied Physics, Justus Liebig University Giessen, Giessen, Germany
- Center for Materials Research, Justus Liebig University Giessen, Giessen, Germany
| | - Andre Schirmeisen
- Institute of Applied Physics, Justus Liebig University Giessen, Giessen, Germany
- Center for Materials Research, Justus Liebig University Giessen, Giessen, Germany
| | - Rubén Pérez
- Departamento de Física Teórica de la Materia Condensada, Universidad Autónoma de Madrid, E-28049, Madrid, Spain.
- Condensed Matter Physics Center (IFIMAC), Universidad Autónoma de Madrid, E-28049, Madrid, Spain.
| |
Collapse
|
11
|
Quesado P, Torres LHM, Ribeiro B, Arrais JP. A Hybrid GNN Approach for Improved Molecular Property Prediction. J Comput Biol 2024; 31:1146-1157. [PMID: 39082155 DOI: 10.1089/cmb.2023.0452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2024] Open
Abstract
The development of new drugs is a vital effort that has the potential to improve human health, well-being and life expectancy. Molecular property prediction is a crucial step in drug discovery, as it helps to identify potential therapeutic compounds. However, experimental methods for drug development can often be time-consuming and resource-intensive, with a low probability of success. To address such limitations, deep learning (DL) methods have emerged as a viable alternative due to their ability to identify high-discriminating patterns in molecular data. In particular, graph neural networks (GNNs) operate on graph-structured data to identify promising drug candidates with desirable molecular properties. These methods represent molecules as a set of node (atoms) and edge (chemical bonds) features to aggregate local information for molecular graph representation learning. Despite the availability of several GNN frameworks, each approach has its own shortcomings. Although, some GNNs may excel in certain tasks, they may not perform as well in others. In this work, we propose a hybrid approach that incorporates different graph-based methods to combine their strengths and mitigate their limitations to accurately predict molecular properties. The proposed approach consists in a multi-layered hybrid GNN architecture that integrates multiple GNN frameworks to compute graph embeddings for molecular property prediction. Furthermore, we conduct extensive experiments on multiple benchmark datasets to demonstrate that our hybrid approach significantly outperforms the state-of-the-art graph-based models. The data and code scripts to reproduce the results are available in the repository, https://github.com/pedro-quesado/HybridGNN.
Collapse
Affiliation(s)
- Pedro Quesado
- Department of Informatics Engineering, Centre for Informatics and Systems of the University of Coimbra, Univ Coimbra, Coimbra, Portugal
| | - Luis H M Torres
- Department of Informatics Engineering, Centre for Informatics and Systems of the University of Coimbra, Univ Coimbra, Coimbra, Portugal
| | - Bernardete Ribeiro
- Department of Informatics Engineering, Centre for Informatics and Systems of the University of Coimbra, Univ Coimbra, Coimbra, Portugal
| | - Joel P Arrais
- Department of Informatics Engineering, Centre for Informatics and Systems of the University of Coimbra, Univ Coimbra, Coimbra, Portugal
| |
Collapse
|
12
|
Zhao D, Zhou J, Tu S, Xu L. De Novo Drug Design by Multi-Objective Path Consistency Learning With Beam A * Search. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2459-2470. [PMID: 39383073 DOI: 10.1109/tcbb.2024.3477592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/11/2024]
Abstract
Generating high-quality and drug-like molecules from scratch within the expansive chemical space presents a significant challenge in the field of drug discovery. In prior research, value-based reinforcement learning algorithms have been employed to generate molecules with multiple desired properties iteratively. The immediate reward was defined as the evaluation of intermediate-state molecules at each step, and the learning objective would be maximizing the expected cumulative evaluation scores for all molecules along the generative path. However, this definition of the reward was misleading, as in reality, the optimization target should be the evaluation score of only the final generated molecule. Furthermore, in previous works, randomness was introduced into the decision-making process, enabling the generation of diverse molecules but no longer pursuing the maximum future rewards. In this paper, immediate reward is defined as the improvement achieved through the modification of the molecule to maximize the evaluation score of the final generated molecule exclusively. Originating from the A search, path consistency (PC), i.e., values on one optimal path should be identical, is employed as the objective function in the update of the value estimator to train a multi-objective de novo drug designer. By incorporating the value into the decision-making process of beam search, the DrugBA algorithm is proposed to enable the large-scale generation of molecules that exhibit both high quality and diversity. Experimental results demonstrate a substantial enhancement over the state-of-the-art algorithm QADD in multiple molecular properties of the generated molecules.
Collapse
|
13
|
Ahmad W, Chong KT, Tayara H. GGAS2SN: Gated Graph and SmilesToSeq Network for Solubility Prediction. J Chem Inf Model 2024; 64:7833-7843. [PMID: 39387596 DOI: 10.1021/acs.jcim.4c00792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Aqueous solubility is a critical physicochemical property of drug discovery. Solubility is a key issue in pharmaceutical development because it can limit a drug's absorption capacity. Accurate solubility prediction is crucial for pharmacological, environmental, and drug development studies. This research introduces a novel method for solubility prediction by combining gated graph neural networks (GGNNs) and graph attention neural networks (GATs) with Smiles2Seq encoding. Our methodology involves converting chemical compounds into graph structures with nodes representing atoms and edges indicating chemical bonds. These graphs are then processed by using a specialized graph neural network (GNN) architecture. Incorporating attention mechanisms into GNN allows for capturing subtle structural dependencies, fostering improved solubility predictions. Furthermore, we utilized the Smiles2Seq encoding technique to bridge the semantic gap between molecular structures and their textual representations. Smiles2Seq seamlessly converts chemical notations into numeric sequences, facilitating the efficient transfer of information into our model. We demonstrate the efficacy of our approach through comprehensive experiments on benchmark solubility data sets, showcasing superior predictive performance compared to traditional methods. Our model outperforms existing solubility prediction models and provides interpretable insights into the molecular features driving solubility behavior. This research signifies an important advancement in solubility prediction, offering potent tools for drug discovery, formulation development, and environmental assessments. The fusion of GGNN and Smiles2Seq encoding establishes a robust framework for accurately forecasting solubility across various chemical compounds, fostering innovation in various domains reliant on solubility data.
Collapse
Affiliation(s)
- Waqar Ahmad
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
- Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Korea
| |
Collapse
|
14
|
Liu H, Chen P, Zhang C, Huang X. Interpretable and Physicochemical-Intuitive Deep Learning Approach for the Design of Thermal Resistance of Energetic Compounds. J Phys Chem A 2024; 128:9045-9054. [PMID: 39380131 DOI: 10.1021/acs.jpca.4c04849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024]
Abstract
Thermal resistance of energetic materials is critical due to its impact on safety and sustainability. However, developing predictive models remains challenging because of data scarcity and limited insights into quantitative structure-property relationships. In this work, a deep learning framework, named EM-thermo, was proposed to address these challenges. A data set comprising 5029 CHNO compounds, including 976 energetic compounds, was constructed to facilitate this study. EM-thermo employs molecular graphs and direct message-passing neural networks to capture structural features and predict thermal resistance. Using transfer learning, the model achieves an accuracy of approximately 97% for predicting the thermal-resistance property (decomposition temperatures above 573.15 K) in energetic compounds. The involvement of molecular descriptors improved model prediction. These findings suggest that EM-thermo is effective for correlating thermal resistance from the atom and covalent bond level, offering a promising tool for advancing molecular design and discovery in the field of energetic compounds.
Collapse
Affiliation(s)
- Haitao Liu
- Institute of Chemical Materials, China Academy of Engineering Physics (CAEP), Mianyang 621900, PR China
- School of National Defense & Nuclear Science and Technology, Southwest University of Science and Technology, Mianyang 621010, PR China
| | - Peng Chen
- Institute of Chemical Materials, China Academy of Engineering Physics (CAEP), Mianyang 621900, PR China
- School of National Defense & Nuclear Science and Technology, Southwest University of Science and Technology, Mianyang 621010, PR China
| | - Chaoyang Zhang
- Institute of Chemical Materials, China Academy of Engineering Physics (CAEP), Mianyang 621900, PR China
- Beijing Computational Science Research Center, Beijing 100193, PR China
| | - Xin Huang
- Institute of Chemical Materials, China Academy of Engineering Physics (CAEP), Mianyang 621900, PR China
| |
Collapse
|
15
|
Rittig JG, Mitsos A. Thermodynamics-consistent graph neural networks. Chem Sci 2024:d4sc04554h. [PMID: 39430937 PMCID: PMC11485056 DOI: 10.1039/d4sc04554h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 10/07/2024] [Indexed: 10/22/2024] Open
Abstract
We propose excess Gibbs free energy graph neural networks (GE-GNNs) for predicting composition-dependent activity coefficients of binary mixtures. The GE-GNN architecture ensures thermodynamic consistency by predicting the molar excess Gibbs free energy and using thermodynamic relations to obtain activity coefficients. As these are differential, automatic differentiation is applied to learn the activity coefficients in an end-to-end manner. Since the architecture is based on fundamental thermodynamics, we do not require additional loss terms to learn thermodynamic consistency. As the output is a fundamental property, we neither impose thermodynamic modeling limitations and assumptions. We demonstrate high accuracy and thermodynamic consistency of the activity coefficient predictions.
Collapse
Affiliation(s)
- Jan G Rittig
- Process Systems Engineering (AVT.SVT), RWTH Aachen University Forckenbeckstraße 51 52074 Aachen Germany
| | - Alexander Mitsos
- Process Systems Engineering (AVT.SVT), RWTH Aachen University Forckenbeckstraße 51 52074 Aachen Germany
- JARA-ENERGY Templergraben 55 52056 Aachen Germany
- Institute of Climate and Energy Systems ICE-1: Energy Systems Engineering, Forschungszentrum Jülich GmbH Wilhelm-Johnen-Straße 52425 Jülich Germany
| |
Collapse
|
16
|
Spiekermann KA, Dong X, Menon A, Green WH, Pfeifle M, Sandfort F, Welz O, Bergeler M. Accurately Predicting Barrier Heights for Radical Reactions in Solution Using Deep Graph Networks. J Phys Chem A 2024; 128:8384-8403. [PMID: 39298746 DOI: 10.1021/acs.jpca.4c04121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
Quantitative estimates of reaction barriers and solvent effects are essential for developing kinetic mechanisms and predicting reaction outcomes. Here, we create a new data set of 5,600 unique elementary radical reactions calculated using the M06-2X/def2-QZVP//B3LYP-D3(BJ)/def2-TZVP level of theory. A conformer search is done for each species using TPSS/def2-TZVP. Gibbs free energies of activation and of reaction for these radical reactions in 40 common solvents are obtained using COSMO-RS for solvation effects. These balanced reactions involve the elements H, C, N, O, and S, contain up to 19 heavy atoms, and have atom-mapped SMILES. All transition states are verified by an intrinsic reaction coordinate calculation. We next train a deep graph network to directly estimate the Gibbs free energy of activation and of reaction in both gas and solution phases using only the atom-mapped SMILES of the reactant and product and the SMILES of the solvent. This simple input representation avoids computationally expensive optimizations for the reactant, transition state, and product structures during inference, making our model well-suited for high-throughput predictive chemistry and quickly providing information for (retro-)synthesis planning tools. To properly measure model performance, we report results on both interpolative and extrapolative data splits and also compare to several baseline models. During training and testing, the data set is augmented by including the reverse direction of each reaction and variants with different resonance structures. After data augmentation, we have around 2 million entries to train the model, which achieves a testing set mean absolute error of 1.16 kcal mol-1 for the Gibbs free energy of activation in solution. We anticipate this model will accelerate predictions for high-throughput screening to quickly identify relevant reactions in solution, and our data set will serve as a benchmark for future studies.
Collapse
Affiliation(s)
- Kevin A Spiekermann
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Xiaorui Dong
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Angiras Menon
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - William H Green
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Mark Pfeifle
- BASF Digital Solutions GmbH, Ludwigshafen am Rhein 67061, Germany
| | - Frederik Sandfort
- BASF SE, Scientific Modeling, Group Research, Ludwigshafen am Rhein 67056, Germany
| | - Oliver Welz
- BASF SE, Scientific Modeling, Group Research, Ludwigshafen am Rhein 67056, Germany
| | - Maike Bergeler
- BASF SE, Scientific Modeling, Group Research, Ludwigshafen am Rhein 67056, Germany
| |
Collapse
|
17
|
Bharadwaj S, Deepika K, Kumar A, Jaiswal S, Miglani S, Singh D, Fartyal P, Kumar R, Singh S, Singh MP, Gaidhane AM, Kumar B, Jha V. Exploring the Artificial Intelligence and Its Impact in Pharmaceutical Sciences: Insights Toward the Horizons Where Technology Meets Tradition. Chem Biol Drug Des 2024; 104:e14639. [PMID: 39396920 DOI: 10.1111/cbdd.14639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 09/03/2024] [Accepted: 09/24/2024] [Indexed: 10/15/2024]
Abstract
The technological revolutions in computers and the advancement of high-throughput screening technologies have driven the application of artificial intelligence (AI) for faster discovery of drug molecules with more efficiency, and cost-friendly finding of hit or lead molecules. The ability of software and network frameworks to interpret molecular structures' representations and establish relationships/correlations has enabled various research teams to develop numerous AI platforms for identifying new lead molecules or discovering new targets for already established drug molecules. The prediction of biological activity, ADME properties, and toxicity parameters in early stages have reduced the chances of failure and associated costs in later clinical stages, which was observed at a high rate in the tedious, expensive, and laborious drug discovery process. This review focuses on the different AI and machine learning (ML) techniques with their applications mainly focused on the pharmaceutical industry. The applications of AI frameworks in the identification of molecular target, hit identification/hit-to-lead optimization, analyzing drug-receptor interactions, drug repurposing, polypharmacology, synthetic accessibility, clinical trial design, and pharmaceutical developments are discussed in detail. We have also compiled the details of various startups in AI in this field. This review will provide a comprehensive analysis and outline various state-of-the-art AI/ML techniques to the readers with their framework applications. This review also highlights the challenges in this field, which need to be addressed for further success in pharmaceutical applications.
Collapse
Affiliation(s)
- Shruti Bharadwaj
- Center for SeNSE, Indian Institute of Technology Delhi (IIT), New Delhi, India
| | - Kumari Deepika
- Department of Computer Engineering, Pune Institute of Computer Technology, Pune, India
| | - Asim Kumar
- Amity Institute of Pharmacy (AIP), Amity University Haryana, Manesar, India
| | - Shivani Jaiswal
- Institute of Pharmaceutical Research, GLA University, Mathura, India
| | - Shaweta Miglani
- Department of Education, Central University of Punjab, Bathinda, India
| | - Damini Singh
- IES Institute of Pharmacy, IES University, Bhopal, Madhya Pradesh, India
| | - Prachi Fartyal
- Department of Mathematics, Govt PG College Bajpur (US Nagar), Bazpur, Uttarakhand, India
| | - Roshan Kumar
- Department of Microbiology, Graphic Era (Deemed to be University), Dehradun, India
- Department of Microbiology, Central University of Punjab, VPO-Ghudda, Punjab, India
| | - Shareen Singh
- Centre for Research Impact & Outcome, Chitkara College of Pharmacy, Chitkara University, Rajpura, Punjab, India
| | - Mahendra Pratap Singh
- Center for Global Health Research, Saveetha Medical College and Hospital, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, India
| | - Abhay M Gaidhane
- Jawaharlal Nehru Medical College, and Global Health Academy, School of Epidemiology and Public Health, Datta Meghe Institute of Higher Education, Wardha, India
| | - Bhupinder Kumar
- Department of Pharmaceutical Science, Hemvati Nandan Bahuguna Garhwal (A Central) University, Srinagar, Uttarakhand, India
| | - Vibhu Jha
- Institute of Cancer Therapeutics, School of Pharmacy and Medical Sciences, Faculty of Life Sciences, University of Bradford, Bradford, UK
| |
Collapse
|
18
|
Beccaria R, Lazzeri A, Tiana G. Predicting the Binding of Small Molecules to Proteins through Invariant Representation of the Molecular Structure. J Chem Inf Model 2024; 64:6758-6767. [PMID: 39197011 DOI: 10.1021/acs.jcim.4c00752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/30/2024]
Abstract
We present a computational scheme for predicting the ligands that bind to a pocket of a known structure. It is based on the generation of a general abstract representation of the molecules, which is invariant to rotations, translations, and permutations of atoms, and has some degree of isometry with the space of conformations. We use these representations to train a nondeep machine learning algorithm to classify the binding between pockets and molecule pairs and show that this approach has a better generalization capability than existing methods.
Collapse
Affiliation(s)
- R Beccaria
- Department of Physics, University of Milano, via Celoria 16, 20133 Milano, Italy
- Heidelberg Institute for Theoretical Studies, Schloss-Wolfsbrunnenweg 35, 69118 Heidelberg, Germany
- Faculty of Physics, Heidelberg University, Im Neuenheimer Feld 227, 69120 Heidelberg, Germany
| | - A Lazzeri
- Department of Physics, University of Milano, via Celoria 16, 20133 Milano, Italy
| | - G Tiana
- Department of Physics, University of Milano, via Celoria 16, 20133 Milano, Italy
- INFN, via Celoria 16, 20133 Milano, Italy
| |
Collapse
|
19
|
Risheh A, Rebel A, Nerenberg PS, Forouzesh N. Calculation of protein-ligand binding entropies using a rule-based molecular fingerprint. Biophys J 2024; 123:2839-2848. [PMID: 38481102 PMCID: PMC11393669 DOI: 10.1016/j.bpj.2024.03.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 12/21/2023] [Accepted: 03/08/2024] [Indexed: 03/28/2024] Open
Abstract
The use of fast in silico prediction methods for protein-ligand binding free energies holds significant promise for the initial phases of drug development. Numerous traditional physics-based models (e.g., implicit solvent models), however, tend to either neglect or heavily approximate entropic contributions to binding due to their computational complexity. Consequently, such methods often yield imprecise assessments of binding strength. Machine learning models provide accurate predictions and can often outperform physics-based models. They, however, are often prone to overfitting, and the interpretation of their results can be difficult. Physics-guided machine learning models combine the consistency of physics-based models with the accuracy of modern data-driven algorithms. This work integrates physics-based model conformational entropies into a graph convolutional network. We introduce a new neural network architecture (a rule-based graph convolutional network) that generates molecular fingerprints according to predefined rules specifically optimized for binding free energy calculations. Our results on 100 small host-guest systems demonstrate significant improvements in convergence and preventing overfitting. We additionally demonstrate the transferability of our proposed hybrid model by training it on the aforementioned host-guest systems and then testing it on six unrelated protein-ligand systems. Our new model shows little difference in training set accuracy compared to a previous model but an order-of-magnitude improvement in test set accuracy. Finally, we show how the results of our hybrid model can be interpreted in a straightforward fashion.
Collapse
Affiliation(s)
- Ali Risheh
- Department of Computer Science, California State University, Los Angeles, California
| | - Alles Rebel
- Department of Computer Science, California State University, Los Angeles, California
| | - Paul S Nerenberg
- Kravis Department of Integrated Sciences, Claremont McKenna College, Claremont, California
| | - Negin Forouzesh
- Department of Computer Science, California State University, Los Angeles, California.
| |
Collapse
|
20
|
Meza-González B, Ramírez-Palma DI, Carpio-Martínez P, Vázquez-Cuevas D, Martínez-Mayorga K, Cortés-Guzmán F. Quantum Topological Atomic Properties of 44K molecules. Sci Data 2024; 11:945. [PMID: 39209874 PMCID: PMC11362522 DOI: 10.1038/s41597-024-03723-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 07/29/2024] [Indexed: 09/04/2024] Open
Abstract
We present a data set of quantum topological properties of atoms of 44K randomly selected molecules from the GDB-9 data set. These atomic properties were obtained as defined by the quantum theory of atoms in molecules (QTAIM) within an atomic basin, a region of real space bounded by zero-flux surfaces in the electron density gradient vector field. The wave function files were generated through DFT static calculations (B3LYP/6-31G), and the atomic properties were calculated using QTAIM. The calculated atomic properties include the energy of the atomic basin, the electronic population, the magnitude of the total dipole moment, and the magnitude of the total quadrupole moment. The atomic properties allow one to understand the chemical structure, reactivity, and molecular recognition. They can be incorporated into force fields for molecular dynamics or for predicting reactive sites. We believe that this data set could facilitate new studies in chemical informatics, machine learning applied to chemistry, and computational molecular design.
Collapse
Affiliation(s)
- Brandon Meza-González
- Facultad de Química, Universidad Nacional Autónoma de México, Ciudad de Méxinclude thexico, Mexico City, Mexico
| | - David I Ramírez-Palma
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Pablo Carpio-Martínez
- Centro Conjunto de Investigación en Química Sustentable UAEM-UNAM, Carretera Toluca-Atlacomulco, km. 14.5, Toluca, Estado de México, C.P. 50200, Mexico
| | - David Vázquez-Cuevas
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Karina Martínez-Mayorga
- Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Mérida, Yucatán, Mexico
| | - Fernando Cortés-Guzmán
- Facultad de Química, Universidad Nacional Autónoma de México, Ciudad de Méxinclude thexico, Mexico City, Mexico.
| |
Collapse
|
21
|
Zhang WY, Zheng XL, Coghi PS, Chen JH, Dong BJ, Fan XX. Revolutionizing adjuvant development: harnessing AI for next-generation cancer vaccines. Front Immunol 2024; 15:1438030. [PMID: 39206192 PMCID: PMC11349682 DOI: 10.3389/fimmu.2024.1438030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
With the COVID-19 pandemic, the importance of vaccines has been widely recognized and has led to increased research and development efforts. Vaccines also play a crucial role in cancer treatment by activating the immune system to target and destroy cancer cells. However, enhancing the efficacy of cancer vaccines remains a challenge. Adjuvants, which enhance the immune response to antigens and improve vaccine effectiveness, have faced limitations in recent years, resulting in few novel adjuvants being identified. The advancement of artificial intelligence (AI) technology in drug development has provided a foundation for adjuvant screening and application, leading to a diversification of adjuvants. This article reviews the significant role of tumor vaccines in basic research and clinical treatment and explores the use of AI technology to screen novel adjuvants from databases. The findings of this review offer valuable insights for the development of new adjuvants for next-generation vaccines.
Collapse
Affiliation(s)
- Wan-Ying Zhang
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, Macao SAR, China
| | - Xiao-Li Zheng
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, Macao SAR, China
| | - Paolo Saul Coghi
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, Macao SAR, China
| | - Jun-Hui Chen
- Intervention and Cell Therapy Center, Peking University Shenzhen Hospital, Shenzhen, China
| | - Bing-Jun Dong
- Gynecology Department, Zhuhai Hospital of Integrated Traditional Chinese and Western Medicine, Zhuhai, China
| | - Xing-Xing Fan
- Dr. Neher’s Biophysics Laboratory for Innovative Drug Discovery, State Key Laboratory of Quality Research in Chinese Medicine, Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macao, Macao SAR, China
| |
Collapse
|
22
|
Aksamit N, Tchagang A, Li Y, Ombuki-Berman B. Hybrid fragment-SMILES tokenization for ADMET prediction in drug discovery. BMC Bioinformatics 2024; 25:255. [PMID: 39090573 PMCID: PMC11295479 DOI: 10.1186/s12859-024-05861-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 07/10/2024] [Indexed: 08/04/2024] Open
Abstract
BACKGROUND Drug discovery and development is the extremely costly and time-consuming process of identifying new molecules that can interact with a biomarker target to interrupt the disease pathway of interest. In addition to binding the target, a drug candidate needs to satisfy multiple properties affecting absorption, distribution, metabolism, excretion, and toxicity (ADMET). Artificial intelligence approaches provide an opportunity to improve each step of the drug discovery and development process, in which the first question faced by us is how a molecule can be informatively represented such that the in-silico solutions are optimized. RESULTS This study introduces a novel hybrid SMILES-fragment tokenization method, coupled with two pre-training strategies, utilizing a Transformer-based model. We investigate the efficacy of hybrid tokenization in improving the performance of ADMET prediction tasks. Our approach leverages MTL-BERT, an encoder-only Transformer model that achieves state-of-the-art ADMET predictions, and contrasts the standard SMILES tokenization with our hybrid method across a spectrum of fragment library cutoffs. CONCLUSION The findings reveal that while an excess of fragments can impede performance, using hybrid tokenization with high frequency fragments enhances results beyond the base SMILES tokenization. This advancement underscores the potential of integrating fragment- and character-level molecular features within the training of Transformer models for ADMET property prediction.
Collapse
Affiliation(s)
- Nicholas Aksamit
- Department of Computer Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada
| | - Alain Tchagang
- Digital Technologies Research Centre, National Research Council Canada, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada
| | - Yifeng Li
- Department of Computer Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada.
- Department of Biological Sciences, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada.
| | - Beatrice Ombuki-Berman
- Department of Computer Science, Brock University, 1812 Sir Isaac Brock Way, St. Catharines, ON, L2S 3A1, Canada.
| |
Collapse
|
23
|
Li X, Zhao X, Yu X, Zhao J, Fang X. Construction of a multi-tissue compound-target interaction network of Qingfei Paidu decoction in COVID-19 treatment based on deep learning and transcriptomic analysis. J Bioinform Comput Biol 2024; 22:2450016. [PMID: 39036847 DOI: 10.1142/s0219720024500161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/23/2024]
Abstract
The Qingfei Paidu decoction (QFPDD) is a widely acclaimed therapeutic formula employed nationwide for the clinical management of coronavirus disease 2019 (COVID-19). QFPDD exerts a synergistic therapeutic effect, characterized by its multi-component, multi-target, and multi-pathway action. However, the intricate interactions among the ingredients and targets within QFPDD and their systematic effects in multiple tissues remain undetermined. To address this, we qualitatively characterized the chemical components of QFPDD. We integrated multi-tissue transcriptomic analysis with GraphDTA, a deep learning model, to screen for potential compound-target interactions of QFPDD in multiple tissues. We predicted 13 key active compounds, 127 potential targets and 27 pathways associated with QFPDD across six different tissues. Notably, oleanolic acid-AXL exhibited leading affinity in the heart, blood, and liver. Molecular docking and molecular dynamics simulation confirmed their strong binding affinity. The robust interaction between oleanolic acid and the AXL receptor suggests that AXL is a promising target for developing clinical intervention strategies. Through the construction of a multi-tissue compound-target interaction network, our study further elucidated the mechanisms through which QFPDD effectively combats COVID-19 in multiple tissues. Our work also establishes a framework for future investigations into the systemic effects of other Traditional Chinese Medicine (TCM) formulas in disease treatment.
Collapse
Affiliation(s)
- Xia Li
- Third Clinical College, Shanxi Provincial Integrated TCM and WM Hospital, Shanxi University of Chinese Medicine, Jinzhong, Shanxi, P. R. China
| | - Xuetong Zhao
- National Genomics Data Center, China National Center for Bioinformation, Beijing 100101, P. R. China
- Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing 100101, P. R. China
- University of Chinese Academy of Sciences, Beijing 100049, P. R. China
| | - Xinjian Yu
- Quantitative and Computational Biosciences Graduate Program, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jianping Zhao
- Third Clinical College, Shanxi Provincial Integrated TCM and WM Hospital, Shanxi University of Chinese Medicine, Jinzhong, Shanxi, P. R. China
| | - Xiangdong Fang
- National Genomics Data Center, China National Center for Bioinformation, Beijing 100101, P. R. China
- University of Chinese Academy of Sciences, Beijing 100049, P. R. China
| |
Collapse
|
24
|
Li J, Yanagisawa K, Akiyama Y. CycPeptMP: enhancing membrane permeability prediction of cyclic peptides with multi-level molecular features and data augmentation. Brief Bioinform 2024; 25:bbae417. [PMID: 39210505 PMCID: PMC11361855 DOI: 10.1093/bib/bbae417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 07/23/2024] [Accepted: 08/22/2024] [Indexed: 09/04/2024] Open
Abstract
Cyclic peptides are versatile therapeutic agents that boast high binding affinity, minimal toxicity, and the potential to engage challenging protein targets. However, the pharmaceutical utility of cyclic peptides is limited by their low membrane permeability-an essential indicator of oral bioavailability and intracellular targeting. Current machine learning-based models of cyclic peptide permeability show variable performance owing to the limitations of experimental data. Furthermore, these methods use features derived from the whole molecule that have traditionally been used to predict small molecules and ignore the unique structural properties of cyclic peptides. This study presents CycPeptMP: an accurate and efficient method to predict cyclic peptide membrane permeability. We designed features for cyclic peptides at the atom-, monomer-, and peptide-levels and seamlessly integrated these into a fusion model using deep learning technology. Additionally, we applied various data augmentation techniques to enhance model training efficiency using the latest data. The fusion model exhibited excellent prediction performance for the logarithm of permeability, with a mean absolute error of $0.355$ and correlation coefficient of $0.883$. Ablation studies demonstrated that all feature levels contributed and were relatively essential to predicting membrane permeability, confirming the effectiveness of augmentation to improve prediction accuracy. A comparison with a molecular dynamics-based method showed that CycPeptMP accurately predicted peptide permeability, which is otherwise difficult to predict using simulations.
Collapse
Affiliation(s)
- Jianan Li
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo 1528550, Japan
| | - Keisuke Yanagisawa
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo 1528550, Japan
- Middle-Molecule ITbased Drug Discovery Laboratory (MIDL), Tokyo Institute of Technology, Tokyo 1528550, Japan
| | - Yutaka Akiyama
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Tokyo 1528550, Japan
- Middle-Molecule ITbased Drug Discovery Laboratory (MIDL), Tokyo Institute of Technology, Tokyo 1528550, Japan
| |
Collapse
|
25
|
Wang T, Xiang G, He S, Su L, Wang Y, Yan X, Lu H. DeepEnzyme: a robust deep learning model for improved enzyme turnover number prediction by utilizing features of protein 3D-structures. Brief Bioinform 2024; 25:bbae409. [PMID: 39162313 PMCID: PMC11880767 DOI: 10.1093/bib/bbae409] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 07/13/2024] [Accepted: 08/04/2024] [Indexed: 08/21/2024] Open
Abstract
Turnover numbers (kcat), which indicate an enzyme's catalytic efficiency, have a wide range of applications in fields including protein engineering and synthetic biology. Experimentally measuring the enzymes' kcat is always time-consuming. Recently, the prediction of kcat using deep learning models has mitigated this problem. However, the accuracy and robustness in kcat prediction still needs to be improved significantly, particularly when dealing with enzymes with low sequence similarity compared to those within the training dataset. Herein, we present DeepEnzyme, a cutting-edge deep learning model that combines the most recent Transformer and Graph Convolutional Network (GCN) to capture the information of both the sequence and 3D-structure of a protein. To improve the prediction accuracy, DeepEnzyme was trained by leveraging the integrated features from both sequences and 3D-structures. Consequently, DeepEnzyme exhibits remarkable robustness when processing enzymes with low sequence similarity compared to those in the training dataset by utilizing additional features from high-quality protein 3D-structures. DeepEnzyme also makes it possible to evaluate how point mutations affect the catalytic activity of the enzyme, which helps identify residue sites that are crucial for the catalytic function. In summary, DeepEnzyme represents a pioneering effort in predicting enzymes' kcat values with improved accuracy and robustness compared to previous algorithms. This advancement will significantly contribute to our comprehension of enzyme function and its evolutionary patterns across species.
Collapse
Affiliation(s)
- Tong Wang
- State Key Laboratory of Microbial Metabolism, School of Life Science and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan RD. Minhang District, Shanghai 200240, China
- College of Science, Chongqing University of Technology, 69 Hongguang Avenue, Banan District, Chongqing 400054, China
| | - Guangming Xiang
- State Key Laboratory of Microbial Metabolism, School of Life Science and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan RD. Minhang District, Shanghai 200240, China
| | - Siwei He
- State Key Laboratory of Microbial Metabolism, School of Life Science and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan RD. Minhang District, Shanghai 200240, China
| | - Liyun Su
- College of Science, Chongqing University of Technology, 69 Hongguang Avenue, Banan District, Chongqing 400054, China
| | - Yuguang Wang
- Institute of Natural Sciences, School of Mathematical Sciences, Zhangjiang Institute of Advanced Study, Shanghai Jiao Tong University, 800 Dongchuan RD. Minhang District, Shanghai 200240, China
- Shanghai Artificial Intelligence Laboratory, 701 Yunjin Road, Xuhui District, Shanghai 200237, China
| | - Xuefeng Yan
- Key Laboratory of Smart Manufacturing in Energy Chemical Process, Ministry of Education, East China University of Science and Technology, 130 Meilong Road, Xuhui District, Shanghai 200237, China
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, 130 Meilong Road, Xuhui District, Shanghai 200237, China
| | - Hongzhong Lu
- State Key Laboratory of Microbial Metabolism, School of Life Science and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan RD. Minhang District, Shanghai 200240, China
| |
Collapse
|
26
|
Ramani V, Karmakar T. Graph Neural Networks for Predicting Solubility in Diverse Solvents Using MolMerger Incorporating Solute-Solvent Interactions. J Chem Theory Comput 2024. [PMID: 39041858 DOI: 10.1021/acs.jctc.4c00382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
The prediction of solubility is a complex and challenging physicochemical problem that has tremendous implications for the chemical and pharmaceutical industry. Recent advancements in machine learning methods have provided a great scope for predicting the reliable solubility of a large number of molecular systems. However, most of these methods rely on using physical properties obtained from experiments and expensive quantum chemical calculations. Here, we developed a method that utilizes a graphical representation of solute-solvent interactions using "MolMerger," which captures the strongest polar interactions between molecules using Gasteiger charges and creates a graph incorporating the true nature of the system. Using these graphs as input, a neural network learns the correlation between the structural properties of a molecule in the form of node embedding and its physicochemical properties as the output. This approach has been used to calculate molecular solubility by predicting the Log solubility values of various organic molecules and pharmaceuticals in diverse sets of solvents.
Collapse
Affiliation(s)
- Vansh Ramani
- Department of Chemical Engineering, Indian Institute of Technology, Delhi, Hauz Khas, New Delhi 110016, India
| | - Tarak Karmakar
- Department of Chemistry, Indian Institute of Technology, Delhi, Hauz Khas, New Delhi 110016, India
| |
Collapse
|
27
|
Catacutan DB, Alexander J, Arnold A, Stokes JM. Machine learning in preclinical drug discovery. Nat Chem Biol 2024:10.1038/s41589-024-01679-1. [PMID: 39030362 DOI: 10.1038/s41589-024-01679-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Accepted: 06/13/2024] [Indexed: 07/21/2024]
Abstract
Drug-discovery and drug-development endeavors are laborious, costly and time consuming. These programs can take upward of 12 years and cost US $2.5 billion, with a failure rate of more than 90%. Machine learning (ML) presents an opportunity to improve the drug-discovery process. Indeed, with the growing abundance of public and private large-scale biological and chemical datasets, ML techniques are becoming well positioned as useful tools that can augment the traditional drug-development process. In this Perspective, we discuss the integration of algorithmic methods throughout the preclinical phases of drug discovery. Specifically, we highlight an array of ML-based efforts, across diverse disease areas, to accelerate initial hit discovery, mechanism-of-action (MOA) elucidation and chemical property optimization. With advances in the application of ML across diverse therapeutic areas, we posit that fully ML-integrated drug-discovery pipelines will define the future of drug-development programs.
Collapse
Affiliation(s)
- Denise B Catacutan
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Jeremie Alexander
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Autumn Arnold
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada
| | - Jonathan M Stokes
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, Canada.
- Michael G. DeGroote Institute for Infectious Disease Research, McMaster University, Hamilton, Ontario, Canada.
- David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, Canada.
| |
Collapse
|
28
|
Liu Y(L, Moretti R, Wang Y, Dong H, Yan B, Bodenheimer B, Derr T, Meiler J. Advancements in Ligand-Based Virtual Screening through the Synergistic Integration of Graph Neural Networks and Expert-Crafted Descriptors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.17.537185. [PMID: 37131837 PMCID: PMC10153143 DOI: 10.1101/2023.04.17.537185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The fusion of traditional chemical descriptors with Graph Neural Networks (GNNs) offers a compelling strategy for enhancing ligand-based virtual screening methodologies. A comprehensive evaluation revealed that the benefits derived from this integrative strategy vary significantly among different GNNs. Specifically, while GCN and SchNet demonstrate pronounced improvements by incorporating descriptors, SphereNet exhibits only marginal enhancement. Intriguingly, despite SphereNet's modest gain, all three models-GCN, SchNet, and SphereNet-achieve comparable performance levels when leveraging this combination strategy. This observation underscores a pivotal insight: sophisticated GNN architectures may be substituted with simpler counterparts without sacrificing efficacy, provided that they are augmented with descriptors. Furthermore, our analysis reveals a set of expert-crafted descriptors' robustness in scaffold-split scenarios, frequently outperforming the combined GNN-descriptor models. Given the critical importance of scaffold splitting in accurately mimicking real-world drug discovery contexts, this finding accentuates an imperative for GNN researchers to innovate models that can adeptly navigate and predict within such frameworks. Our work not only validates the potential of integrating descriptors with GNNs in advancing ligand-based virtual screening but also illuminates pathways for future enhancements in model development and application. Our implementation can be found at https://github.com/meilerlab/gnn-descriptor.
Collapse
Affiliation(s)
- Yunchao (Lance) Liu
- Department of Computer Science, Vanderbilt University, 2201 West End Ave Nashville, Tennessee 37235, USA
| | - Rocco Moretti
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, 2201 West End Ave Nashville, Tennessee 37235, USA
| | - Yu Wang
- Department of Computer Science, Vanderbilt University, 2201 West End Ave Nashville, Tennessee 37235, USA
| | - Ha Dong
- Department of Neural Science, Amherst College, 220 South Pleasant Street Amherst, Massachusetts 01002, USA
| | - Bailu Yan
- Department of Biostatistics, Vanderbilt University, 2201 West End Ave Nashville, Tennessee 37235, USA
| | - Bobby Bodenheimer
- Department of Computer Science, Electrical Engineering and Computer Engineering, Vanderbilt University, 2201 West End Ave Nashville, Tennessee 37235, USA
| | - Tyler Derr
- Department of Computer Science, Data Science Institute, Data Science Institute, Vanderbilt University, 2201 West End Ave Nashville, Tennessee 37235, USA
| | - Jens Meiler
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, 2201 West End Ave Nashville, Tennessee 37235, USA, Institute of Drug Discovery, Leipzig University Medical School, Härtelstraße 16-18, Leipzig, 04103, Germany, Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Humboldtstraße 25, Leipzig, 04105, Germany
| |
Collapse
|
29
|
Carnino JM, Pellegrini WR, Willis M, Cohen MB, Paz-Lansberg M, Davis EM, Grillone GA, Levi JR. Assessing ChatGPT's Responses to Otolaryngology Patient Questions. Ann Otol Rhinol Laryngol 2024; 133:658-664. [PMID: 38676440 DOI: 10.1177/00034894241249621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/28/2024]
Abstract
OBJECTIVE This study aims to evaluate ChatGPT's performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare. METHODS A cross-sectional study was conducted using patient questions from the public online forum Reddit's r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified. RESULTS Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous. CONCLUSION While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.
Collapse
Affiliation(s)
- Jonathan M Carnino
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
| | - William R Pellegrini
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Megan Willis
- Department of Biostatistics, Boston University, Boston, MA, USA
| | - Michael B Cohen
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Marianella Paz-Lansberg
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Elizabeth M Davis
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Gregory A Grillone
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| | - Jessica R Levi
- Boston University Chobanian & Avedisian School of Medicine, Boston, MA, USA
- Department of Otolaryngology-Head and Neck Surgery, Boston Medical Center, Boston, MA, USA
| |
Collapse
|
30
|
Tang X, Tran A, Tan J, Gerstein MB. MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations. Bioinformatics 2024; 40:i357-i368. [PMID: 38940177 PMCID: PMC11256921 DOI: 10.1093/bioinformatics/btae260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. RESULTS We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. AVAILABILITY AND IMPLEMENTATION Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.
Collapse
Affiliation(s)
- Xiangru Tang
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Andrew Tran
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Jeffrey Tan
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Mark B Gerstein
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
31
|
Xiang W, Zhong F, Ni L, Zheng M, Li X, Shi Q, Wang D. Gram matrix: an efficient representation of molecular conformation and learning objective for molecular pretraining. Brief Bioinform 2024; 25:bbae340. [PMID: 38990515 PMCID: PMC11238115 DOI: 10.1093/bib/bbae340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 06/05/2024] [Accepted: 06/28/2024] [Indexed: 07/12/2024] Open
Abstract
Accurate prediction of molecular properties is fundamental in drug discovery and development, providing crucial guidance for effective drug design. A critical factor in achieving accurate molecular property prediction lies in the appropriate representation of molecular structures. Presently, prevalent deep learning-based molecular representations rely on 2D structure information as the primary molecular representation, often overlooking essential three-dimensional (3D) conformational information due to the inherent limitations of 2D structures in conveying atomic spatial relationships. In this study, we propose employing the Gram matrix as a condensed representation of 3D molecular structures and for efficient pretraining objectives. Subsequently, we leverage this matrix to construct a novel molecular representation model, Pre-GTM, which inherently encapsulates 3D information. The model accurately predicts the 3D structure of a molecule by estimating the Gram matrix. Our findings demonstrate that Pre-GTM model outperforms the baseline Graphormer model and other pretrained models in the QM9 and MoleculeNet quantitative property prediction task. The integration of the Gram matrix as a condensed representation of 3D molecular structure, incorporated into the Pre-GTM model, opens up promising avenues for its potential application across various domains of molecular research, including drug design, materials science, and chemical engineering.
Collapse
Affiliation(s)
| | - Feisheng Zhong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
- Fujian Key Laboratory of Drug Target Discovery and Structural and Functional Research, School of Pharmacy, Fujian Medical University, Fuzhou 350122, China
| | - Lin Ni
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing 210023, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
- Nanjing University of Chinese Medicine, 138 Xianlin Road, Nanjing 210023, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Qian Shi
- Lingang Laboratory, Shanghai 200031, China
| | | |
Collapse
|
32
|
Ju W, Fang Z, Gu Y, Liu Z, Long Q, Qiao Z, Qin Y, Shen J, Sun F, Xiao Z, Yang J, Yuan J, Zhao Y, Wang Y, Luo X, Zhang M. A Comprehensive Survey on Deep Graph Representation Learning. Neural Netw 2024; 173:106207. [PMID: 38442651 DOI: 10.1016/j.neunet.2024.106207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 01/23/2024] [Accepted: 02/21/2024] [Indexed: 03/07/2024]
Abstract
Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields, including machine learning and data mining. Classic graph embedding methods follow the basic idea that the embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i) traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii) representation learning and downstream tasks are dependent on each other which should be jointly enhanced. With the remarkable success of deep learning, deep graph representation learning has shown great potential and advantages over shallow (traditional) methods, there exist a large number of deep graph representation learning techniques have been proposed in the past decade, especially graph neural networks. In this survey, we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential components of graph representation learning and categorize existing approaches by the ways of graph neural network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides the practical and promising applications of deep graph representation learning. Last but not least, we state new perspectives and suggest challenging directions which deserve further investigations in the future.
Collapse
Affiliation(s)
- Wei Ju
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Zheng Fang
- School of Intelligence Science and Technology, Peking University, Beijing, 100871, China
| | - Yiyang Gu
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Zequn Liu
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Qingqing Long
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, 100086, China
| | - Ziyue Qiao
- Artificial Intelligence Thrust, The Hong Kong University of Science and Technology, Guangzhou, 511453, China
| | - Yifang Qin
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Jianhao Shen
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Fang Sun
- Department of Computer Science, University of California, Los Angeles, 90095, USA
| | - Zhiping Xiao
- Department of Computer Science, University of California, Los Angeles, 90095, USA
| | - Junwei Yang
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Jingyang Yuan
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Yusheng Zhao
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China
| | - Yifan Wang
- School of Information Technology & Management, University of International Business and Economics, Beijing, 100029, China
| | - Xiao Luo
- Department of Computer Science, University of California, Los Angeles, 90095, USA.
| | - Ming Zhang
- School of Computer Science, National Key Laboratory for Multimedia Information Processing, Peking University, Beijing, 100871, China.
| |
Collapse
|
33
|
Zheng EJ, Valeri JA, Andrews IW, Krishnan A, Bandyopadhyay P, Anahtar MN, Herneisen A, Schulte F, Linnehan B, Wong F, Stokes JM, Renner LD, Lourido S, Collins JJ. Discovery of antibiotics that selectively kill metabolically dormant bacteria. Cell Chem Biol 2024; 31:712-728.e9. [PMID: 38029756 PMCID: PMC11031330 DOI: 10.1016/j.chembiol.2023.10.026] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 08/13/2023] [Accepted: 10/30/2023] [Indexed: 12/01/2023]
Abstract
There is a need to discover and develop non-toxic antibiotics that are effective against metabolically dormant bacteria, which underlie chronic infections and promote antibiotic resistance. Traditional antibiotic discovery has historically favored compounds effective against actively metabolizing cells, a property that is not predictive of efficacy in metabolically inactive contexts. Here, we combine a stationary-phase screening method with deep learning-powered virtual screens and toxicity filtering to discover compounds with lethality against metabolically dormant bacteria and favorable toxicity profiles. The most potent and structurally distinct compound without any obvious mechanistic liability was semapimod, an anti-inflammatory drug effective against stationary-phase E. coli and A. baumannii. Integrating microbiological assays, biochemical measurements, and single-cell microscopy, we show that semapimod selectively disrupts and permeabilizes the bacterial outer membrane by binding lipopolysaccharide. This work illustrates the value of harnessing non-traditional screening methods and deep learning models to identify non-toxic antibacterial compounds that are effective in infection-relevant contexts.
Collapse
Affiliation(s)
- Erica J Zheng
- Program in Chemical Biology, Harvard University, Cambridge, MA 02138, USA; Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Jacqueline A Valeri
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Medical Engineering & Science, Department of Biological Engineering, and Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Ian W Andrews
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Medical Engineering & Science, Department of Biological Engineering, and Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Aarti Krishnan
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Medical Engineering & Science, Department of Biological Engineering, and Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Parijat Bandyopadhyay
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Medical Engineering & Science, Department of Biological Engineering, and Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Melis N Anahtar
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Medical Engineering & Science, Department of Biological Engineering, and Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA
| | - Alice Herneisen
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA; Department of Biology, MIT, Cambridge, MA 02139, USA
| | - Fabian Schulte
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Brooke Linnehan
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Felix Wong
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Medical Engineering & Science, Department of Biological Engineering, and Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Jonathan M Stokes
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario L8N 3Z5, Canada
| | - Lars D Renner
- Leibniz Institute of Polymer Research and the Max Bergmann Center of Biomaterials, 01062 Dresden, Germany
| | - Sebastian Lourido
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA; Department of Biology, MIT, Cambridge, MA 02139, USA
| | - James J Collins
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Institute for Medical Engineering & Science, Department of Biological Engineering, and Synthetic Biology Center, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA 02115, USA; Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA 02139, USA.
| |
Collapse
|
34
|
Zhang Z, Bian Y, Xie A, Han P, Zhou S. Can Pretrained Models Really Learn Better Molecular Representations for AI-Aided Drug Discovery? J Chem Inf Model 2024; 64:2921-2930. [PMID: 38145387 PMCID: PMC11005046 DOI: 10.1021/acs.jcim.3c01707] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 11/29/2023] [Accepted: 11/29/2023] [Indexed: 12/26/2023]
Abstract
Self-supervised pretrained models are gaining increasingly more popularity in AI-aided drug discovery, leading to more and more pretrained models with the promise that they can extract better feature representations for molecules. Yet, the quality of learned representations has not been fully explored. In this work, inspired by the two phenomena of Activity Cliffs (ACs) and Scaffold Hopping (SH) in traditional Quantitative Structure-Activity Relationship analysis, we propose a method named Representation-Property Relationship Analysis (RePRA) to evaluate the quality of the representations extracted by the pretrained model and visualize the relationship between the representations and properties. The concepts of ACs and SH are generalized from the structure-activity context to the representation-property context, and the underlying principles of RePRA are analyzed theoretically. Two scores are designed to measure the generalized ACs and SH detected by RePRA, and therefore, the quality of representations can be evaluated. In experiments, representations of molecules from 10 target tasks generated by 7 pretrained models are analyzed. The results indicate that the state-of-the-art pretrained models can overcome some shortcomings of canonical Extended-Connectivity FingerPrints, while the correlation between the basis of the representation space and specific molecular substructures are not explicit. Thus, some representations could be even worse than the canonical fingerprints. Our method enables researchers to evaluate the quality of molecular representations generated by their proposed self-supervised pretrained models. And our findings can guide the community to develop better pretraining techniques to regularize the occurrence of ACs and SH.
Collapse
Affiliation(s)
- Ziqiao Zhang
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| | | | - Ailin Xie
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| | - Pengju Han
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| | - Shuigeng Zhou
- Shanghai
Key Lab of Intelligent Information Processing, and School of Computer
Science, Fudan University, Shanghai 200438, China
| |
Collapse
|
35
|
Mehta MJ, Kim HJ, Lim SB, Naito M, Miyata K. Recent Progress in the Endosomal Escape Mechanism and Chemical Structures of Polycations for Nucleic Acid Delivery. Macromol Biosci 2024; 24:e2300366. [PMID: 38226723 DOI: 10.1002/mabi.202300366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 12/22/2023] [Indexed: 01/17/2024]
Abstract
Nucleic acid-based therapies are seeing a spiralling surge. Stimuli-responsive polymers, especially pH-responsive ones, are gaining widespread attention because of their ability to efficiently deliver nucleic acids. These polymers can be synthesized and modified according to target requirements, such as delivery sites and the nature of nucleic acids. In this regard, the endosomal escape mechanism of polymer-nucleic acid complexes (polyplexes) remains a topic of considerable interest owing to various plausible escape mechanisms. This review describes current progress in the endosomal escape mechanism of polyplexes and state-of-the-art chemical designs for pH-responsive polymers. The importance is also discussed of the acid dissociation constant (i.e., pKa) in designing the new generation of pH-responsive polymers, along with assays to monitor and quantify the endosomal escape behavior. Further, the use of machine learning is addressed in pKa prediction and polymer design to find novel chemical structures for pH responsiveness. This review will facilitate the design of new pH-responsive polymers for advanced and efficient nucleic acid delivery.
Collapse
Affiliation(s)
- Mohit J Mehta
- Department of Biological Sciences and Bioengineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea
| | - Hyun Jin Kim
- Department of Biological Sciences and Bioengineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea
- Department of Biological Engineering, College of Engineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea
| | - Sung Been Lim
- Department of Biological Sciences and Bioengineering, Inha University, 100 Inha-ro, Michuhol-gu, Incheon, 22212, Republic of Korea
| | - Mitsuru Naito
- Department of Materials Engineering, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
| | - Kanjiro Miyata
- Department of Materials Engineering, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
- Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8656, Japan
| |
Collapse
|
36
|
Llompart P, Minoletti C, Baybekov S, Horvath D, Marcou G, Varnek A. Will we ever be able to accurately predict solubility? Sci Data 2024; 11:303. [PMID: 38499581 PMCID: PMC10948805 DOI: 10.1038/s41597-024-03105-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Accepted: 02/29/2024] [Indexed: 03/20/2024] Open
Abstract
Accurate prediction of thermodynamic solubility by machine learning remains a challenge. Recent models often display good performances, but their reliability may be deceiving when used prospectively. This study investigates the origins of these discrepancies, following three directions: a historical perspective, an analysis of the aqueous solubility dataverse and data quality. We investigated over 20 years of published solubility datasets and models, highlighting overlooked datasets and the overlaps between popular sets. We benchmarked recently published models on a novel curated solubility dataset and report poor performances. We also propose a workflow to cure aqueous solubility data aiming at producing useful models for bench chemist. Our results demonstrate that some state-of-the-art models are not ready for public usage because they lack a well-defined applicability domain and overlook historical data sources. We report the impact of factors influencing the utility of the models: interlaboratory standard deviation, ionic state of the solute and data sources. The herein obtained models, and quality-assessed datasets are publicly available.
Collapse
Affiliation(s)
- P Llompart
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
- IDD/CADD, Sanofi, Vitry-Sur-Seine, France
| | | | - S Baybekov
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - D Horvath
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| | - G Marcou
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France.
| | - A Varnek
- Laboratory of Chemoinformatics, UMR7140, University of Strasbourg, Strasbourg, France
| |
Collapse
|
37
|
Jung SG, Jung G, Cole JM. Automatic Prediction of Peak Optical Absorption Wavelengths in Molecules Using Convolutional Neural Networks. J Chem Inf Model 2024; 64:1486-1501. [PMID: 38422386 PMCID: PMC10934802 DOI: 10.1021/acs.jcim.3c01792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/15/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024]
Abstract
Molecular design depends heavily on optical properties for applications such as solar cells and polymer-based batteries. Accurate prediction of these properties is essential, and multiple predictive methods exist, from ab initio to data-driven techniques. Although theoretical methods, such as time-dependent density functional theory (TD-DFT) calculations, have well-established physical relevance and are among the most popular methods in computational physics and chemistry, they exhibit errors that are inherent in their approximate nature. These high-throughput electronic structure calculations also incur a substantial computational cost. With the emergence of big-data initiatives, cost-effective, data-driven methods have gained traction, although their usability is highly contingent on the degree of data quality and sparsity. In this study, we present a workflow that employs deep residual convolutional neural networks (DR-CNN) and gradient boosting feature selection to predict peak optical absorption wavelengths (λmax) exclusively from SMILES representations of dye molecules and solvents; one would normally measure λmax using UV-vis absorption spectroscopy. We use a multifidelity modeling approach, integrating 34,893 DFT calculations and 26,395 experimentally derived λmax data, to deliver more accurate predictions via a Bayesian-optimized gradient boosting machine. Our approach is benchmarked against the state of the art that is reported in the scientific literature; results demonstrate that learnt representations via a DR-CNN workflow that is integrated with other machine learning methods can accelerate the design of molecules for specific optical characteristics.
Collapse
Affiliation(s)
- Son Gyo Jung
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
- ISIS
Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
- Research
Complex at Harwell, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0FA, U.K.
| | - Guwon Jung
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
- Research
Complex at Harwell, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0FA, U.K.
- Scientific
Computing Department, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
| | - Jacqueline M. Cole
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.
- ISIS
Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.
- Research
Complex at Harwell, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0FA, U.K.
| |
Collapse
|
38
|
Shi W, Lin K, Zhao Y, Li Z, Zhou T. Toward a comprehensive understanding of alicyclic compounds: Bio-effects perspective and deep learning approach. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 912:168927. [PMID: 38042202 DOI: 10.1016/j.scitotenv.2023.168927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 11/17/2023] [Accepted: 11/25/2023] [Indexed: 12/04/2023]
Abstract
The escalating use of alicyclic compounds in modern industrial production has led to a rapid increase of these substances in the environment, posing significant health hazards. Addressing this challenge necessitates a comprehensive understanding of these compounds, which can be achieved through the deep learning approach. Graph neural networks (GNN) known for its' extraordinary ability to process graph data with rich relationships, have been employed in various molecular prediction tasks. In this study, alicyclic molecules screened from PCBA, Toxcast and Tox21 are made as general bioactivity and biological targets' activity prediction datasets. GNN-based models are trained on the two datasets, while the Attentive FP and PAGTN achieve best performance individually. In addition, alicyclic carbon atoms make the greatest contribution to biological activity, which indicate that the alicycle structures have significant impact on the carbon atoms' contribution. Moreover, there are terrific number of active molecules in other public datasets, indicates that alicyclic compounds deserve more attention in POPs control. This study uncovered deeper structural-activity relationships within these compounds, offering new perspectives and methodologies for academic research in the field.
Collapse
Affiliation(s)
- Wenjie Shi
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China.
| | - Kunsen Lin
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China.
| | - Youcai Zhao
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China; Shanghai Institute of Pollution Control and Ecological Security, 1515 North Zhongshan Rd. (No. 2), Shanghai 200092, PR China
| | - Zongsheng Li
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China
| | - Tao Zhou
- The State Key Laboratory of Pollution Control and Resource Reuse, School of Environmental Science and Engineering, Tongji University, 1239 Siping Road, Shanghai 200092, China; Shanghai Institute of Pollution Control and Ecological Security, 1515 North Zhongshan Rd. (No. 2), Shanghai 200092, PR China.
| |
Collapse
|
39
|
Zhang R, Mahjour B, Outlaw A, McGrath A, Hopper T, Kelley B, Walters WP, Cernak T. Exploring the combinatorial explosion of amine-acid reaction space via graph editing. Commun Chem 2024; 7:22. [PMID: 38310120 PMCID: PMC10838272 DOI: 10.1038/s42004-024-01101-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Accepted: 01/08/2024] [Indexed: 02/05/2024] Open
Abstract
Amines and carboxylic acids are abundant chemical feedstocks that are nearly exclusively united via the amide coupling reaction. The disproportionate use of the amide coupling leaves a large section of unexplored reaction space between amines and acids: two of the most common chemical building blocks. Herein we conduct a thorough exploration of amine-acid reaction space via systematic enumeration of reactions involving a simple amine-carboxylic acid pair. This approach to chemical space exploration investigates the coarse and fine modulation of physicochemical properties and molecular shapes. With the invention of reaction methods becoming increasingly automated and bringing conceptual reactions into reality, our map provides an entirely new axis of chemical space exploration for rational property design.
Collapse
Affiliation(s)
- Rui Zhang
- Department of Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Babak Mahjour
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Andrew Outlaw
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Andrew McGrath
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA
| | | | | | | | - Tim Cernak
- Department of Chemistry, University of Michigan, Ann Arbor, MI, USA.
- Department of Medicinal Chemistry, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
40
|
Hadiby S, Ben Ali YM. Integrating pharmacophore model and deep learning for activity prediction of molecules with BRCA1 gene. J Bioinform Comput Biol 2024; 22:2450003. [PMID: 38567386 DOI: 10.1142/s0219720024500033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
In this paper, we propose a novel approach for predicting the activity/inactivity of molecules with the BRCA1 gene by combining pharmacophore modeling and deep learning techniques. Initially, we generated 3D pharmacophore fingerprints using a pharmacophore model, which captures the essential features and spatial arrangements critical for biological activity. These fingerprints served as informative representations of the molecular structures. Next, we employed deep learning algorithms to train a predictive model using the generated pharmacophore fingerprints. The deep learning model was designed to learn complex patterns and relationships between the pharmacophore features and the corresponding activity/inactivity labels of the molecules. By utilizing this integrated approach, we aimed to enhance the accuracy and efficiency of activity prediction. To validate the effectiveness of our approach, we conducted experiments using a dataset of known molecules with BRCA1 gene activity/inactivity from diverse sources. Our results demonstrated promising predictive performance, indicating the successful integration of pharmacophore modeling and deep learning. Furthermore, we utilized the trained model to predict the activity/inactivity of unknown molecules extracted from the ChEMBL database. The predictions obtained from the ChEMBL database were assessed and compared against experimentally determined values to evaluate the reliability and generalizability of our model. Overall, our proposed approach showcased significant potential in accurately predicting the activity/inactivity of molecules with the BRCA1 gene, thus enabling the identification of potential candidates for further investigation in drug discovery and development processes.
Collapse
Affiliation(s)
- Seloua Hadiby
- Department of Computer Science, Computer Research Laboratory, Badji Mokhtar University, Annaba, Algeria
| | - Yamina Mohamed Ben Ali
- Department of Computer Science, Computer Research Laboratory, Badji Mokhtar University, Annaba, Algeria
| |
Collapse
|
41
|
Song Z, Chen J, Cheng J, Chen G, Qi Z. Computer-Aided Molecular Design of Ionic Liquids as Advanced Process Media: A Review from Fundamentals to Applications. Chem Rev 2024; 124:248-317. [PMID: 38108629 DOI: 10.1021/acs.chemrev.3c00223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
The unique physicochemical properties, flexible structural tunability, and giant chemical space of ionic liquids (ILs) provide them a great opportunity to match different target properties to work as advanced process media. The crux of the matter is how to efficiently and reliably tailor suitable ILs toward a specific application. In this regard, the computer-aided molecular design (CAMD) approach has been widely adapted to cover this family of high-profile chemicals, that is, to perform computer-aided IL design (CAILD). This review discusses the past developments that have contributed to the state-of-the-art of CAILD and provides a perspective about how future works could pursue the acceleration of the practical application of ILs. In a broad context of CAILD, key aspects related to the forward structure-property modeling and reverse molecular design of ILs are overviewed. For the former forward task, diverse IL molecular representations, modeling algorithms, as well as representative models on physical properties, thermodynamic properties, among others of ILs are introduced. For the latter reverse task, representative works formulating different molecular design scenarios are summarized. Beyond the substantial progress made, some future perspectives to move CAILD a step forward are finally provided.
Collapse
Affiliation(s)
- Zhen Song
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jiahui Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Jie Cheng
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Guzhong Chen
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| | - Zhiwen Qi
- State Key laboratory of Chemical Engineering, School of Chemical Engineering, East China University of Science and Technology, 130 Meilong Road, Shanghai 200237, China
| |
Collapse
|
42
|
Zhu J, Che C, Jiang H, Xu J, Yin J, Zhong Z. SSF-DDI: a deep learning method utilizing drug sequence and substructure features for drug-drug interaction prediction. BMC Bioinformatics 2024; 25:39. [PMID: 38262923 PMCID: PMC10810255 DOI: 10.1186/s12859-024-05654-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 01/12/2024] [Indexed: 01/25/2024] Open
Abstract
BACKGROUND Drug-drug interactions (DDI) are prevalent in combination therapy, necessitating the importance of identifying and predicting potential DDI. While various artificial intelligence methods can predict and identify potential DDI, they often overlook the sequence information of drug molecules and fail to comprehensively consider the contribution of molecular substructures to DDI. RESULTS In this paper, we proposed a novel model for DDI prediction based on sequence and substructure features (SSF-DDI) to address these issues. Our model integrates drug sequence features and structural features from the drug molecule graph, providing enhanced information for DDI prediction and enabling a more comprehensive and accurate representation of drug molecules. CONCLUSION The results of experiments and case studies have demonstrated that SSF-DDI significantly outperforms state-of-the-art DDI prediction models across multiple real datasets and settings. SSF-DDI performs better in predicting DDI involving unknown drugs, resulting in a 5.67% improvement in accuracy compared to state-of-the-art methods.
Collapse
Affiliation(s)
- Jing Zhu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Chao Che
- School of Software Engineering, Dalian University, Dalian, 116000, China
| | - Hao Jiang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Jian Xu
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Jiajun Yin
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Zhaoqian Zhong
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China.
| |
Collapse
|
43
|
Maziarka Ł, Majchrowski D, Danel T, Gaiński P, Tabor J, Podolak I, Morkisz P, Jastrzębski S. Relative molecule self-attention transformer. J Cheminform 2024; 16:3. [PMID: 38173009 PMCID: PMC10765783 DOI: 10.1186/s13321-023-00789-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Accepted: 11/28/2023] [Indexed: 01/05/2024] Open
Abstract
The prediction of molecular properties is a crucial aspect in drug discovery that can save a lot of money and time during the drug design process. The use of machine learning methods to predict molecular properties has become increasingly popular in recent years. Despite advancements in the field, several challenges remain that need to be addressed, like finding an optimal pre-training procedure to improve performance on small datasets, which are common in drug discovery. In our paper, we tackle these problems by introducing Relative Molecule Self-Attention Transformer for molecular representation learning. It is a novel architecture that uses relative self-attention and 3D molecular representation to capture the interactions between atoms and bonds that enrich the backbone model with domain-specific inductive biases. Furthermore, our two-step pretraining procedure allows us to tune only a few hyperparameter values to achieve good performance comparable with state-of-the-art models on a wide selection of downstream tasks.
Collapse
Affiliation(s)
- Łukasz Maziarka
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Cracow, Poland.
| | | | - Tomasz Danel
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Cracow, Poland
| | - Piotr Gaiński
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Cracow, Poland
- Ardigen, Podole 76, 30-394, Cracow, Poland
| | - Jacek Tabor
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Cracow, Poland
| | - Igor Podolak
- Faculty of Mathematics and Computer Science, Jagiellonian University, Łojasiewicza 6, 30-348, Cracow, Poland
| | - Paweł Morkisz
- NVIDIA, 2788 San Tomas Expy, Santa Clara, CA, 95051, USA
| | | |
Collapse
|
44
|
Ghahremanpour MM, Saar A, Tirado-Rives J, Jorgensen WL. Ensemble Geometric Deep Learning of Aqueous Solubility. J Chem Inf Model 2023; 63:7338-7349. [PMID: 37990484 DOI: 10.1021/acs.jcim.3c01536] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2023]
Abstract
Geometric deep learning is one of the main workhorses for harnessing the power of big data to predict molecular properties such as aqueous solubility, which is key to the pharmacokinetic improvement of drug candidates. Two ensembles of graph neural network architectures were built, one based on spectral convolution and the other on spatial convolution. The pretrained models, denoted respectively as SolNet-GCN and SolNet-GAT, significantly outperformed the existing neural networks benchmarked on a validation set of 207 molecules. The SolNet-GCN model demonstrated the best performance on both the training and validation sets, with RMSE values of 0.53 and 0.72 log molar unit and Pearson r2 values of 0.95 and 0.75, respectively. Further, the ranking power of the SolNet models agreed well with a QM-based thermodynamic cycle approach at the PBE-vdW level of theory on a series of benzophenylurea derivatives and a series of benzodiazepine derivatives. Nevertheless, testing the resultant models on a set of inhibitors of the macrophage migration inhibitory factor (MIF) illustrated that the inclusion of atomic attributes to discriminate atoms with a higher tendency to form intermolecular hydrogen bonds in the crystalline state and to identify planar or nonplanar substructures can be beneficial for the prediction of aqueous solubility.
Collapse
Affiliation(s)
| | - Anastasia Saar
- Department of Chemistry, Yale University New Haven, Connecticut 06520-8107, United States
| | - Julian Tirado-Rives
- Department of Chemistry, Yale University New Haven, Connecticut 06520-8107, United States
| | - William L Jorgensen
- Department of Chemistry, Yale University New Haven, Connecticut 06520-8107, United States
| |
Collapse
|
45
|
McGibbon M, Shave S, Dong J, Gao Y, Houston DR, Xie J, Yang Y, Schwaller P, Blay V. From intuition to AI: evolution of small molecule representations in drug discovery. Brief Bioinform 2023; 25:bbad422. [PMID: 38033290 PMCID: PMC10689004 DOI: 10.1093/bib/bbad422] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/13/2023] [Accepted: 11/01/2023] [Indexed: 12/02/2023] Open
Abstract
Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.
Collapse
Affiliation(s)
- Miles McGibbon
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Steven Shave
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jie Dong
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha, 410013, China
| | - Yumiao Gao
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Douglas R Houston
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| | - Jiancong Xie
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Yuedong Yang
- Key Laboratory of Machine Intelligence and Advanced Computing, Sun Yat-Sen University, Guangzhou, 510000, China
| | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence (LIAC), Institut des Sciences et Ingénierie Chimiques, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
| | - Vincent Blay
- Institute of Quantitative Biology, Biochemistry and Biotechnology, University of Edinburgh, Edinburgh, Scotland EH9 3BF, United Kingdom
| |
Collapse
|
46
|
Wang Z, Feng Z, Li Y, Li B, Wang Y, Sha C, He M, Li X. BatmanNet: bi-branch masked graph transformer autoencoder for molecular representation. Brief Bioinform 2023; 25:bbad400. [PMID: 38033291 PMCID: PMC10783874 DOI: 10.1093/bib/bbad400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 10/02/2023] [Accepted: 10/17/2023] [Indexed: 12/02/2023] Open
Abstract
Although substantial efforts have been made using graph neural networks (GNNs) for artificial intelligence (AI)-driven drug discovery, effective molecular representation learning remains an open challenge, especially in the case of insufficient labeled molecules. Recent studies suggest that big GNN models pre-trained by self-supervised learning on unlabeled datasets enable better transfer performance in downstream molecular property prediction tasks. However, the approaches in these studies require multiple complex self-supervised tasks and large-scale datasets , which are time-consuming, computationally expensive and difficult to pre-train end-to-end. Here, we design a simple yet effective self-supervised strategy to simultaneously learn local and global information about molecules, and further propose a novel bi-branch masked graph transformer autoencoder (BatmanNet) to learn molecular representations. BatmanNet features two tailored complementary and asymmetric graph autoencoders to reconstruct the missing nodes and edges, respectively, from a masked molecular graph. With this design, BatmanNet can effectively capture the underlying structure and semantic information of molecules, thus improving the performance of molecular representation. BatmanNet achieves state-of-the-art results for multiple drug discovery tasks, including molecular properties prediction, drug-drug interaction and drug-target interaction, on 13 benchmark datasets, demonstrating its great potential and superiority in molecular representation learning.
Collapse
Affiliation(s)
- Zhen Wang
- College of Electrical and Information Engineering, Hunan University, Changsha, 410082, Hunan, China
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Zheng Feng
- Department of Health Outcomes & Biomedical Informatics, College of Medecine, University of Florida, Gainesville, 32611, FL, USA
| | - Yanjun Li
- Department of Medicinal Chemistry, College of Pharmacy, University of Florida, Gainesville, 32610, FL, USA
- Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, 32610, FL, USA
| | - Bowen Li
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Yongrui Wang
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Chulin Sha
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Min He
- College of Electrical and Information Engineering, Hunan University, Changsha, 410082, Hunan, China
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
| | - Xiaolin Li
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, Zhejiang, China
- ElasticMind Inc, Hangzhou, 310018, Zhejiang, China
| |
Collapse
|
47
|
Shen C, Luo J, Xia K. Molecular geometric deep learning. CELL REPORTS METHODS 2023; 3:100621. [PMID: 37875121 PMCID: PMC10694498 DOI: 10.1016/j.crmeth.2023.100621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Revised: 06/16/2023] [Accepted: 09/28/2023] [Indexed: 10/26/2023]
Abstract
Molecular representation learning plays an important role in molecular property prediction. Existing molecular property prediction models rely on the de facto standard of covalent-bond-based molecular graphs for representing molecular topology at the atomic level and totally ignore the non-covalent interactions within the molecule. In this study, we propose a molecular geometric deep learning model to predict the properties of molecules that aims to comprehensively consider the information of covalent and non-covalent interactions of molecules. The essential idea is to incorporate a more general molecular representation into geometric deep learning (GDL) models. We systematically test molecular GDL (Mol-GDL) on fourteen commonly used benchmark datasets. The results show that Mol-GDL can achieve a better performance than state-of-the-art (SOTA) methods. Extensive tests have demonstrated the important role of non-covalent interactions in molecular property prediction and the effectiveness of Mol-GDL models.
Collapse
Affiliation(s)
- Cong Shen
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410000, China; School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410000, China.
| | - Kelin Xia
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore.
| |
Collapse
|
48
|
Xia S, Chen E, Zhang Y. Integrated Molecular Modeling and Machine Learning for Drug Design. J Chem Theory Comput 2023; 19:7478-7495. [PMID: 37883810 PMCID: PMC10653122 DOI: 10.1021/acs.jctc.3c00814] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 10/10/2023] [Accepted: 10/11/2023] [Indexed: 10/28/2023]
Abstract
Modern therapeutic development often involves several stages that are interconnected, and multiple iterations are usually required to bring a new drug to the market. Computational approaches have increasingly become an indispensable part of helping reduce the time and cost of the research and development of new drugs. In this Perspective, we summarize our recent efforts on integrating molecular modeling and machine learning to develop computational tools for modulator design, including a pocket-guided rational design approach based on AlphaSpace to target protein-protein interactions, delta machine learning scoring functions for protein-ligand docking as well as virtual screening, and state-of-the-art deep learning models to predict calculated and experimental molecular properties based on molecular mechanics optimized geometries. Meanwhile, we discuss remaining challenges and promising directions for further development and use a retrospective example of FDA approved kinase inhibitor Erlotinib to demonstrate the use of these newly developed computational tools.
Collapse
Affiliation(s)
- Song Xia
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Eric Chen
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department
of Chemistry, New York University, New York, New York 10003, United States
- Simons
Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU
Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
49
|
Wilson AN, St John PC, Marin DH, Hoyt CB, Rognerud EG, Nimlos MR, Cywar RM, Rorrer NA, Shebek KM, Broadbelt LJ, Beckham GT, Crowley MF. PolyID: Artificial Intelligence for Discovering Performance-Advantaged and Sustainable Polymers. Macromolecules 2023; 56:8547-8557. [PMID: 38024155 PMCID: PMC10653284 DOI: 10.1021/acs.macromol.3c00994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 09/30/2023] [Indexed: 12/01/2023]
Abstract
A necessary transformation for a sustainable economy is the transition from fossil-derived plastics to polymers derived from biomass and waste resources. While renewable feedstocks can enhance material performance through unique chemical moieties, probing the vast material design space by experiment alone is not practically feasible. Here, we develop a machine-learning-based tool, PolyID, to reduce the design space of renewable feedstocks to enable efficient discovery of performance-advantaged, biobased polymers. PolyID is a multioutput, graph neural network specifically designed to increase accuracy and to enable quantitative structure-property relationship (QSPR) analysis for polymers. It includes a novel domain-of-validity method that was developed and applied to demonstrate how gaps in training data can be filled to improve accuracy. The model was benchmarked with both a 20% held-out subset of the original training data and 22 experimentally synthesized polymers. A mean absolute error for the glass transition temperatures of 19.8 and 26.4 °C was achieved for the test and experimental data sets, respectively. Predictions were made on polymers composed of monomers from four databases that contain biologically accessible small molecules: MetaCyc, MINEs, KEGG, and BiGG. From 1.4 × 106 accessible biobased polymers, we identified five poly(ethylene terephthalate) (PET) analogues with predicted improvements to thermal and transport performance. Experimental validation for one of the PET analogues demonstrated a glass transition temperature between 85 and 112 °C, which is higher than PET and within the predicted range of the PolyID tool. In addition to accurate predictions, we show how the model's predictions are explainable through analysis of individual bond importance for a biobased nylon. Overall, PolyID can aid the biobased polymer practitioner to navigate the vast number of renewable polymers to discover sustainable materials with enhanced performance.
Collapse
Affiliation(s)
- A. Nolan Wilson
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Peter C. St John
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Daniela H. Marin
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Caroline B. Hoyt
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Erik G. Rognerud
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Mark R. Nimlos
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Robin M. Cywar
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Nicholas A. Rorrer
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Kevin M. Shebek
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
- Department
of Chemical and Biological Engineering and Center for Synthetic Biology, Northwestern University, Evanston, Illinois 60208, United States
- Chemistry
of Life Processes Institute, Northwestern
University, Evanston, Illinois 60208, United States
| | - Linda J. Broadbelt
- Department
of Chemical and Biological Engineering and Center for Synthetic Biology, Northwestern University, Evanston, Illinois 60208, United States
| | - Gregg T. Beckham
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| | - Michael F. Crowley
- Renewable
Resources and Enabling Sciences Center, National Renewable Energy Laboratory, 15013 Denver West Parkway, Golden, Colorado 80401, United States
| |
Collapse
|
50
|
Rebello NJ, Lin TS, Nazeer H, Olsen BD. BigSMARTS: A Topologically Aware Query Language and Substructure Search Algorithm for Polymer Chemical Structures. J Chem Inf Model 2023; 63:6555-6568. [PMID: 37874026 DOI: 10.1021/acs.jcim.3c00978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Molecular search is important in chemistry, biology, and informatics for identifying molecular structures within large data sets, improving knowledge discovery and innovation, and making chemical data FAIR (findable, accessible, interoperable, reusable). Search algorithms for polymers are significantly less developed than those for small molecules because polymer search relies on searching by polymer name, which can be challenging because polymer naming is overly broad (i.e., polyethylene), complicated for complex chemical structures, and often does not correspond to official IUPAC conventions. Chemical structure search in polymers is limited to substructures, such as monomers, without awareness of connectivity or topology. This work introduces a novel query language and graph traversal search algorithm for polymers that provides the first search method able to fully capture all of the chemical structures present in polymers. The BigSMARTS query language, an extension of the small-molecule SMARTS language, allows users to write queries that localize monomer and functional group searches to different parts of the polymer, like the middle block of a triblock, the side chain of a graft, and the backbone of a repeat unit. The substructure search algorithm is based on the traversal of graph representations of the generating functions for the stochastic graphs of polymers. Operationally, the algorithm first identifies cycles representing the monomers and then the end groups and finally performs a depth-first search to match entire subgraphs. To validate the algorithm, hundreds of queries were searched against hundreds of target chemistries and topologies from the literature, with approximately 440,000 query-target pairs. This tool provides a detailed algorithm that can be implemented in search engines to provide search results with full matching of the monomer connectivity and polymer topology.
Collapse
Affiliation(s)
- Nathan J Rebello
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Tzyy-Shyang Lin
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Heeba Nazeer
- Department of Computer Science, Wellesley College, 106 Central Street, Wellesley, Massachusetts 02481, United States
| | - Bradley D Olsen
- Department of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| |
Collapse
|