1
|
Han Y, Xu X, Hsieh CY, Ding K, Xu H, Xu R, Hou T, Zhang Q, Chen H. Retrosynthesis prediction with an iterative string editing model. Nat Commun 2024; 15:6404. [PMID: 39080274 PMCID: PMC11289138 DOI: 10.1038/s41467-024-50617-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 07/09/2024] [Indexed: 08/02/2024] Open
Abstract
Retrosynthesis is a crucial task in drug discovery and organic synthesis, where artificial intelligence (AI) is increasingly employed to expedite the process. However, existing approaches employ token-by-token decoding methods to translate target molecule strings into corresponding precursors, exhibiting unsatisfactory performance and limited diversity. As chemical reactions typically induce local molecular changes, reactants and products often overlap significantly. Inspired by this fact, we propose reframing single-step retrosynthesis prediction as a molecular string editing task, iteratively refining target molecule strings to generate precursor compounds. Our proposed approach involves a fragment-based generative editing model that uses explicit sequence editing operations. Additionally, we design an inference module with reposition sampling and sequence augmentation to enhance both prediction accuracy and diversity. Extensive experiments demonstrate that our model generates high-quality and diverse results, achieving superior performance with a promising top-1 accuracy of 60.8% on the standard benchmark dataset USPTO-50 K.
Collapse
Affiliation(s)
- Yuqiang Han
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Xiaoyang Xu
- Polytechnic Institute, Zhejiang University, Hangzhou, 310015, China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310018, China
| | - Keyan Ding
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Hongxia Xu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310018, China
- Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 311121, China
| | - Renjun Xu
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310018, China.
| | - Qiang Zhang
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China.
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China.
| | - Huajun Chen
- College of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China.
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, 311200, China.
- Zhejiang-University-Ant-Group Joint Center for Knowledge Graphs, Hangzhou, 310000, China.
- Hangzhou Institute of Medicine Chinese Academy of Science, Hangzhou, 310023, China.
| |
Collapse
|
2
|
Yuan Y, Tang X, Li H, Lang X, Li C, Song Y, Sun S, Yang Y, Zhou Z. KLSD: a kinase database focused on ligand similarity and diversity. Front Pharmacol 2024; 15:1400136. [PMID: 38957398 PMCID: PMC11217335 DOI: 10.3389/fphar.2024.1400136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 05/28/2024] [Indexed: 07/04/2024] Open
Abstract
Due to the similarity and diversity among kinases, small molecule kinase inhibitors (SMKIs) often display multi-target effects or selectivity, which have a strong correlation with the efficacy and safety of these inhibitors. However, due to the limited number of well-known popular databases and their restricted data mining capabilities, along with the significant scarcity of databases focusing on the pharmacological similarity and diversity of SMIKIs, researchers find it challenging to quickly access relevant information. The KLIFS database is representative of specialized application databases in the field, focusing on kinase structure and co-crystallised kinase-ligand interactions, whereas the KLSD database in this paper emphasizes the analysis of SMKIs among all reported kinase targets. To solve the current problem of the lack of professional application databases in kinase research and to provide centralized, standardized, reliable and efficient data resources for kinase researchers, this paper proposes a research program based on the ChEMBL database. It focuses on kinase ligands activities comparisons. This scheme extracts kinase data and standardizes and normalizes them, then performs kinase target difference analysis to achieve kinase activity threshold judgement. It then constructs a specialized and personalized kinase database platform, adopts the front-end and back-end separation technology of SpringBoot architecture, constructs an extensible WEB application, handles the storage, retrieval and analysis of the data, ultimately realizing data visualization and interaction. This study aims to develop a kinase database platform to collect, organize, and provide standardized data related to kinases. By offering essential resources and tools, it supports kinase research and drug development, thereby advancing scientific research and innovation in kinase-related fields. It is freely accessible at: http://ai.njucm.edu.cn:8080.
Collapse
Affiliation(s)
- Yuqian Yuan
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing, China
| | - Xiaozhu Tang
- School of Medicine and Holistic Integrative Medicine, Nanjing University of Chinese Medicine, Nanjing, China
| | - Hongyan Li
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing, China
| | - Xufeng Lang
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing, China
| | - Can Li
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing, China
| | - Yihua Song
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing, China
| | - Shanliang Sun
- National and Local Collaborative Engineering Center of Chinese Medicinal Resources Industrialization and Formulae Innovative Medicine, Jiangsu Collaborative Innovation Center of Chinese Medicinal Resources Industrialization, Jiangsu Key Laboratory for High Technology Research of TCM Formulae, Nanjing University of Chinese Medicine, Nanjing, China
| | - Ye Yang
- School of Medicine and Holistic Integrative Medicine, Nanjing University of Chinese Medicine, Nanjing, China
| | - Zuojian Zhou
- School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing, China
| |
Collapse
|
3
|
Das M, Ghosh A, Sunoj RB. Advances in machine learning with chemical language models in molecular property and reaction outcome predictions. J Comput Chem 2024; 45:1160-1176. [PMID: 38299229 DOI: 10.1002/jcc.27315] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 01/06/2024] [Accepted: 01/09/2024] [Indexed: 02/02/2024]
Abstract
Molecular properties and reactions form the foundation of chemical space. Over the years, innumerable molecules have been synthesized, a smaller fraction of them found immediate applications, while a larger proportion served as a testimony to creative and empirical nature of the domain of chemical science. With increasing emphasis on sustainable practices, it is desirable that a target set of molecules are synthesized preferably through a fewer empirical attempts instead of a larger library, to realize an active candidate. In this front, predictive endeavors using machine learning (ML) models built on available data acquire high timely significance. Prediction of molecular property and reaction outcome remain one of the burgeoning applications of ML in chemical science. Among several methods of encoding molecular samples for ML models, the ones that employ language like representations are gaining steady popularity. Such representations would additionally help adopt well-developed natural language processing (NLP) models for chemical applications. Given this advantageous background, herein we describe several successful chemical applications of NLP focusing on molecular property and reaction outcome predictions. From relatively simpler recurrent neural networks (RNNs) to complex models like transformers, different network architecture have been leveraged for tasks such as de novo drug design, catalyst generation, forward and retro-synthesis predictions. The chemical language model (CLM) provides promising avenues toward a broad range of applications in a time and cost-effective manner. While we showcase an optimistic outlook of CLMs, attention is also placed on the persisting challenges in reaction domain, which would optimistically be addressed by advanced algorithms tailored to chemical language and with increased availability of high-quality datasets.
Collapse
Affiliation(s)
- Manajit Das
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Ankit Ghosh
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
| | - Raghavan B Sunoj
- Department of Chemistry, Indian Institute of Technology Bombay, Mumbai, India
- Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai, India
| |
Collapse
|
4
|
Kotlyarov R, Papachristos K, Wood GPF, Goodman JM. Leveraging Language Model Multitasking To Predict C-H Borylation Selectivity. J Chem Inf Model 2024; 64:4286-4297. [PMID: 38708520 PMCID: PMC11134489 DOI: 10.1021/acs.jcim.4c00137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 04/05/2024] [Accepted: 04/23/2024] [Indexed: 05/07/2024]
Abstract
C-H borylation is a high-value transformation in the synthesis of lead candidates for the pharmaceutical industry because a wide array of downstream coupling reactions is available. However, predicting its regioselectivity, especially in drug-like molecules that may contain multiple heterocycles, is not a trivial task. Using a data set of borylation reactions from Reaxys, we explored how a language model originally trained on USPTO_500_MT, a broad-scope set of patent data, can be used to predict the C-H borylation reaction product in different modes: product generation and site reactivity classification. Our fine-tuned T5Chem multitask language model can generate the correct product in 79% of cases. It can also classify the reactive aromatic C-H bonds with 95% accuracy and 88% positive predictive value, exceeding purpose-developed graph-based neural networks.
Collapse
Affiliation(s)
- Ruslan Kotlyarov
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Road, Cambridge CB2 1EW, U.K.
| | | | - Geoffrey P. F. Wood
- Exscientia
Plc, The Schrödinger Building, Oxford Science Park, Oxford OX4 4GE, U.K.
| | - Jonathan M. Goodman
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Road, Cambridge CB2 1EW, U.K.
| |
Collapse
|
5
|
Qiu X, Wang H, Tan X, Fang Z. G-K BertDTA: A graph representation learning and semantic embedding-based framework for drug-target affinity prediction. Comput Biol Med 2024; 173:108376. [PMID: 38552281 DOI: 10.1016/j.compbiomed.2024.108376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 03/21/2024] [Accepted: 03/24/2024] [Indexed: 04/17/2024]
Abstract
Developing new drugs is costly, time-consuming, and risky. Drug-target affinity (DTA), indicating the binding capability between drugs and target proteins, is a crucial indicator for drug development. Accurately predicting interaction strength between new drug-target pairs by analyzing previous experiments aids in screening potential drug molecules, repurposing them, and developing safe and effective medicines. Existing computational models for DTA prediction rely on strings or single-graph neural networks, lacking consideration of protein structure and molecular semantic information, leading to limited accuracy. Our experiments demonstrate that string-based methods may overlook protein conformations, causing a high root mean square error (RMSE) of 3.584 in affinity due to a lack of spatial context. Single graph networks also underperform on topology features, with a 6% lower confidence interval (CI) for activity classification. Absent semantic information also limits generalization across diverse compounds, resulting in 18% increment in RMSE and 5% in misclassifications within quantifications study, restricting potential drug discovery. To address these limitations, we propose G-K BertDTA, a novel framework for accurate DTA prediction incorporating protein features, molecular semantic features, and molecular structural information. In this proposed model, we represent drugs as graphs, with a GIN employed to learn the molecular topological information. For the extraction of protein structural features, we utilize a DenseNet architecture. A knowledge-based BERT semantic model is incorporated to obtain rich pre-trained semantic embeddings, thereby enhancing the feature information. We extensively evaluated our proposed approach on the publicly available benchmark datasets (i.e., KIBA and Davis), and experimental results demonstrate the promising performance of our method, which consistently outperforms previous state-of-the-art approaches. Code is available at https://github.com/AmbitYuki/G-K-BertDTA.
Collapse
Affiliation(s)
- Xihe Qiu
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Haoyu Wang
- School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
| | - Xiaoyu Tan
- INF Technology (Shanghai) Co., Ltd., Shanghai, China
| | - Zhijun Fang
- School of Computer Science and Technology, Donghua University, Shanghai, China.
| |
Collapse
|
6
|
Meewan I, Panmanee J, Petchyam N, Lertvilai P. HBCVTr: an end-to-end transformer with a deep neural network hybrid model for anti-HBV and HCV activity predictor from SMILES. Sci Rep 2024; 14:9262. [PMID: 38649402 PMCID: PMC11035669 DOI: 10.1038/s41598-024-59933-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 04/16/2024] [Indexed: 04/25/2024] Open
Abstract
Hepatitis B and C viruses (HBV and HCV) are significant causes of chronic liver diseases, with approximately 350 million infections globally. To accelerate the finding of effective treatment options, we introduce HBCVTr, a novel ligand-based drug design (LBDD) method for predicting the inhibitory activity of small molecules against HBV and HCV. HBCVTr employs a hybrid model consisting of double encoders of transformers and a deep neural network to learn the relationship between small molecules' simplified molecular-input line-entry system (SMILES) and their antiviral activity against HBV or HCV. The prediction accuracy of HBCVTr has surpassed baseline machine learning models and existing methods, with R-squared values of 0.641 and 0.721 for the HBV and HCV test sets, respectively. The trained models were successfully applied to virtual screening against 10 million compounds within 240 h, leading to the discovery of the top novel inhibitor candidates, including IJN04 for HBV and IJN12 and IJN19 for HCV. Molecular docking and dynamics simulations identified IJN04, IJN12, and IJN19 target proteins as the HBV core antigen, HCV NS5B RNA-dependent RNA polymerase, and HCV NS3/4A serine protease, respectively. Overall, HBCVTr offers a new and rapid drug discovery and development screening method targeting HBV and HCV.
Collapse
Affiliation(s)
- Ittipat Meewan
- Center for Advanced Therapeutics, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom, 73170, Thailand.
| | - Jiraporn Panmanee
- Research Center for Neuroscience, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom, 73170, Thailand
| | - Nopphon Petchyam
- Center for Advanced Therapeutics, Institute of Molecular Biosciences, Mahidol University, Nakhon Pathom, 73170, Thailand
| | - Pichaya Lertvilai
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, 92037, USA
| |
Collapse
|
7
|
Lin J, He Y, Ru C, Long W, Li M, Wen Z. Advancing Adverse Drug Reaction Prediction with Deep Chemical Language Model for Drug Safety Evaluation. Int J Mol Sci 2024; 25:4516. [PMID: 38674100 PMCID: PMC11050562 DOI: 10.3390/ijms25084516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 04/15/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
The accurate prediction of adverse drug reactions (ADRs) is essential for comprehensive drug safety evaluation. Pre-trained deep chemical language models have emerged as powerful tools capable of automatically learning molecular structural features from large-scale datasets, showing promising capabilities for the downstream prediction of molecular properties. However, the performance of pre-trained chemical language models in predicting ADRs, especially idiosyncratic ADRs induced by marketed drugs, remains largely unexplored. In this study, we propose MoLFormer-XL, a pre-trained model for encoding molecular features from canonical SMILES, in conjunction with a CNN-based model to predict drug-induced QT interval prolongation (DIQT), drug-induced teratogenicity (DIT), and drug-induced rhabdomyolysis (DIR). Our results demonstrate that the proposed model outperforms conventional models applied in previous studies for predicting DIQT, DIT, and DIR. Notably, an analysis of the learned linear attention maps highlights amines, alcohol, ethers, and aromatic halogen compounds as strongly associated with the three types of ADRs. These findings hold promise for enhancing drug discovery pipelines and reducing the drug attrition rate due to safety concerns.
Collapse
Affiliation(s)
- Jinzhu Lin
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Yujie He
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Chengxiang Ru
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Wulin Long
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Menglong Li
- College of Chemistry, Sichuan University, Chengdu 610064, China
| | - Zhining Wen
- College of Chemistry, Sichuan University, Chengdu 610064, China
- Medical Big Data Center, Sichuan University, Chengdu 610064, China
| |
Collapse
|
8
|
Chen C, Huang Z, Zou X, Li S, Zhang D, Wang SL. Prediction of molecular-specific mutagenic alerts and related mechanisms of chemicals by a convolutional neural network (CNN) model based on SMILES split. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 917:170435. [PMID: 38286298 DOI: 10.1016/j.scitotenv.2024.170435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/11/2023] [Revised: 01/20/2024] [Accepted: 01/23/2024] [Indexed: 01/31/2024]
Abstract
Structural alerts (SAs) are essential to identify chemicals for toxicity evaluation and health risk assessment. We constructed a novel SMILES split-based deep learning model (SSDL) that was trained and verified with 5850 chemicals from the ISSSTY database and 384 external test chemicals from published papers. The training accuracy was above 0.90 and the evaluation metrics (precision, recall and F1-score) all reached 0.78 or above on both internal and external test chemicals. In this model, the molecular-specific fragment importance of chemicals was first quantified independently. Then, the SA identification method based on the importance of these fragments was statistically analyzed and verified with the ISSSTY test and external test chemicals containing one of 28 typical SAs, and most of the performances were better than that of expert rules. Furthermore, a mutagenicity mechanism prediction method was developed using 237 chemicals with four known mutagenic mechanisms based on molecular similarity calibrated by the SSDL method and fragment importance, which significantly improved accuracy in three mechanisms and had comparable accuracy in the other one compared to traditional methods. Overall, the SSDL model quantifying fragment toxicity within molecules would be a novel potentially powerful tool in the determination and visualization of molecular-specific SAs and the prediction of mutagenicity mechanisms for environmental or industrial compounds and drugs.
Collapse
Affiliation(s)
- Chao Chen
- Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing 211166, PR China
| | - Zhengliang Huang
- Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing 211166, PR China; School of Public Health, Hubei University of Medicine, Shiyan 442000, PR China
| | - Xuyan Zou
- Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing 211166, PR China
| | - Sheng Li
- Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing 211166, PR China
| | - Di Zhang
- Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing 211166, PR China
| | - Shou-Lin Wang
- Key Laboratory of Modern Toxicology of Ministry of Education, Center for Global Health, School of Public Health, Nanjing Medical University, 101 Longmian Avenue, Nanjing 211166, PR China; State Key Lab of Reproductive Medicine and Offspring Health, Institute of Toxicology, Nanjing Medical University, 101 Longmian Avenue, Nanjing 211166, PR China.
| |
Collapse
|
9
|
An L, Chen B, Zhang Y, Li H, Huang R, Li F, Tang Y. Compound Similarity Network as a Novel Data Mining Strategy for High-Throughput Investigation of Degradation Pathways of Organic Pollutants in Industrial Wastewater Treatment. Anal Chem 2024; 96:3951-3959. [PMID: 38377587 DOI: 10.1021/acs.analchem.3c05983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2024]
Abstract
Identification of degradation products and pathways is crucial for investigating emerging pollutants and evaluation of wastewater treatment methods. Nontargeted analysis is a powerful tool to comprehensively investigate the degradation pathways of organic pollutants in real-world wastewater samples but often generates large data sets, making it difficult to effectively locate the exact information on interests. Herein, to efficiently establish the linkages among compounds in the same degradation pathways, we introduce a compound similarity network (CSN) as a novel data mining strategy for LC-MS-based nontargeted analysis of complex wastewater samples. Different from molecular networks that cluster compounds based on MS/MS spectra similarity, our CSN strategy harnesses molecular fingerprints to establish linkages among compounds and thus is spectra-independent. The effectiveness of CSN was demonstrated by nontargeted identification of degradation pathways and products of organic pollutants in leather industrial wastewater that underwent laboratory-scale activated carbon adsorption (ACD) and ozonation treatments. Utilizing CSN in interpreting nontargeted data, we tentatively annotated 4324 compounds in the untreated leather industrial wastewater, 3246 after ACD, and 3777 after ACD/ozonation. We located 145 potential degradation pathways of organic pollutants in the ACD/ozonation process using CSN and validated 7 pathways with 15 chemical standards. CSN also revealed 5 clusters of emerging pollutants, from which 3 compounds were selected for in vitro cytotoxicity study to evaluate their potential biohazards as new pollutants. As CSN offers an efficient way to connect massive compounds and to find multiple degradation pathways in a high-throughput manner, we anticipate that it will find wide applications in nontargeted analysis of diverse environmental samples.
Collapse
Affiliation(s)
- Lirong An
- Analytical & Testing Center, Key Laboratory of Green Chemistry & Technology of Ministry of Education, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, China
| | - Bin Chen
- Analytical & Testing Center, Key Laboratory of Green Chemistry & Technology of Ministry of Education, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, China
| | - Yuchen Zhang
- Sichuan Provincial Key Laboratory of Universities on Environmental Science and Engineering, MOE Key Laboratory of Deep Earth Science and Engineering, College of Architecture and Environment, Sichuan University, Chengdu, Sichuan 610065, China
| | - Hailiang Li
- Analytical & Testing Center, Key Laboratory of Green Chemistry & Technology of Ministry of Education, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, China
| | - Rongfu Huang
- Sichuan Provincial Key Laboratory of Universities on Environmental Science and Engineering, MOE Key Laboratory of Deep Earth Science and Engineering, College of Architecture and Environment, Sichuan University, Chengdu, Sichuan 610065, China
| | - Feng Li
- Analytical & Testing Center, Key Laboratory of Green Chemistry & Technology of Ministry of Education, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, China
| | - Yanan Tang
- Analytical & Testing Center, Key Laboratory of Green Chemistry & Technology of Ministry of Education, College of Chemistry, Sichuan University, Chengdu, Sichuan 610064, China
| |
Collapse
|
10
|
Temizer AB, Uludoğan G, Özçelik R, Koulani T, Ozkirimli E, Ulgen KO, Karali N, Özgür A. Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties. Mol Inform 2024; 43:e202300249. [PMID: 38196065 DOI: 10.1002/minf.202300249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 11/13/2023] [Accepted: 01/06/2024] [Indexed: 01/11/2024]
Abstract
Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.
Collapse
Affiliation(s)
- Asu Busra Temizer
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
- Department of Pharmaceutical Chemistry, Institute of Health Sciences, İstanbul University, İstanbul, Turkey
| | - Gökçe Uludoğan
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Rıza Özçelik
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Taha Koulani
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
- Department of Pharmaceutical Chemistry, Institute of Health Sciences, İstanbul University, İstanbul, Turkey
| | - Elif Ozkirimli
- Science and Research Informatics, F. Hoffmann-La Roche Ltd, Basel, Switzerland
| | - Kutlu O Ulgen
- Department of Chemical Engineering, Boğaziçi University, İstanbul, Turkey
| | - Nilgun Karali
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| |
Collapse
|
11
|
Jinsong S, Qifeng J, Xing C, Hao Y, Wang L. Molecular fragmentation as a crucial step in the AI-based drug development pathway. Commun Chem 2024; 7:20. [PMID: 38302655 PMCID: PMC10834946 DOI: 10.1038/s42004-024-01109-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 01/19/2024] [Indexed: 02/03/2024] Open
Abstract
The AI-based small molecule drug discovery has become a significant trend at the intersection of computer science and life sciences. In the pursuit of novel compounds, fragment-based drug discovery has emerged as a novel approach. The Generative Pre-trained Transformers (GPT) model has showcased remarkable prowess across various domains, rooted in its pre-training and representation learning of fundamental linguistic units. Analogous to natural language, molecular encoding, as a form of chemical language, necessitates fragmentation aligned with specific chemical logic for accurate molecular encoding. This review provides a comprehensive overview of the current state of the art in molecular fragmentation. We systematically summarize the approaches and applications of various molecular fragmentation techniques, with special emphasis on the characteristics and scope of applicability of each technique, and discuss their applications. We also provide an outlook on the current development trends of molecular fragmentation techniques, including some potential research directions and challenges.
Collapse
Affiliation(s)
- Shao Jinsong
- Nantong University, School of Information Science and Technology, Nantong, China
| | - Jia Qifeng
- Nantong University, School of Information Science and Technology, Nantong, China
| | - Chen Xing
- Nantong University, School of Information Science and Technology, Nantong, China
| | - Yajie Hao
- Nantong University, School of Information Science and Technology, Nantong, China
| | - Li Wang
- Nantong University, Research Center for Intelligence Information Technology, Nantong, China.
| |
Collapse
|
12
|
Zhu J, Che C, Jiang H, Xu J, Yin J, Zhong Z. SSF-DDI: a deep learning method utilizing drug sequence and substructure features for drug-drug interaction prediction. BMC Bioinformatics 2024; 25:39. [PMID: 38262923 PMCID: PMC10810255 DOI: 10.1186/s12859-024-05654-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 01/12/2024] [Indexed: 01/25/2024] Open
Abstract
BACKGROUND Drug-drug interactions (DDI) are prevalent in combination therapy, necessitating the importance of identifying and predicting potential DDI. While various artificial intelligence methods can predict and identify potential DDI, they often overlook the sequence information of drug molecules and fail to comprehensively consider the contribution of molecular substructures to DDI. RESULTS In this paper, we proposed a novel model for DDI prediction based on sequence and substructure features (SSF-DDI) to address these issues. Our model integrates drug sequence features and structural features from the drug molecule graph, providing enhanced information for DDI prediction and enabling a more comprehensive and accurate representation of drug molecules. CONCLUSION The results of experiments and case studies have demonstrated that SSF-DDI significantly outperforms state-of-the-art DDI prediction models across multiple real datasets and settings. SSF-DDI performs better in predicting DDI involving unknown drugs, resulting in a 5.67% improvement in accuracy compared to state-of-the-art methods.
Collapse
Affiliation(s)
- Jing Zhu
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Chao Che
- School of Software Engineering, Dalian University, Dalian, 116000, China
| | - Hao Jiang
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China
| | - Jian Xu
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Jiajun Yin
- General Surgery, Affiliated Zhongshan Hospital of Dalian University, Dalian, 116000, China
| | - Zhaoqian Zhong
- Key Laboratory of Advanced Design and Intelligent Computing, Ministry of Education, Dalian University, Dalian, 116000, China.
| |
Collapse
|
13
|
Wei L, Fu N, Song Y, Wang Q, Hu J. Probabilistic generative transformer language models for generative design of molecules. J Cheminform 2023; 15:88. [PMID: 37749655 PMCID: PMC10518939 DOI: 10.1186/s13321-023-00759-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Accepted: 09/10/2023] [Indexed: 09/27/2023] Open
Abstract
Self-supervised neural language models have recently found wide applications in the generative design of organic molecules and protein sequences as well as representation learning for downstream structure classification and functional prediction. However, most of the existing deep learning models for molecule design usually require a big dataset and have a black-box architecture, which makes it difficult to interpret their design logic. Here we propose the Generative Molecular Transformer (GMTransformer), a probabilistic neural network model for generative design of molecules. Our model is built on the blank filling language model originally developed for text processing, which has demonstrated unique advantages in learning the "molecules grammars" with high-quality generation, interpretability, and data efficiency. Benchmarked on the MOSES datasets, our models achieve high novelty and Scaf compared to other baselines. The probabilistic generation steps have the potential in tinkering with molecule design due to their capability of recommending how to modify existing molecules with explanation, guided by the learned implicit molecule chemistry. The source code and datasets can be accessed freely at https://github.com/usccolumbia/GMTransformer.
Collapse
Affiliation(s)
- Lai Wei
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29201, USA
| | - Nihang Fu
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29201, USA
| | - Yuqi Song
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29201, USA
| | - Qian Wang
- Department of Chemistry and Biochemistry, University of South Carolina, Columbia, SC, 29201, USA
| | - Jianjun Hu
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, 29201, USA.
| |
Collapse
|
14
|
Ucak UV, Ashyrmamatov I, Lee J. Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. J Cheminform 2023; 15:55. [PMID: 37248531 PMCID: PMC10228139 DOI: 10.1186/s13321-023-00725-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 05/14/2023] [Indexed: 05/31/2023] Open
Abstract
Tokenization is an important preprocessing step in natural language processing that may have a significant influence on prediction quality. This research showed that the traditional SMILES tokenization has a certain limitation that results in tokens failing to reflect the true nature of molecules. To address this issue, we developed the atom-in-SMILES tokenization scheme that eliminates ambiguities in the generic nature of SMILES tokens. Our results in multiple chemical translation and molecular property prediction tasks demonstrate that proper tokenization has a significant impact on prediction quality. In terms of prediction accuracy and token degeneration, atom-in-SMILES is more effective method in generating higher-quality SMILES sequences from AI-based chemical models compared to other tokenization and representation schemes. We investigated the degrees of token degeneration of various schemes and analyzed their adverse effects on prediction quality. Additionally, token-level repetitions were quantified, and generated examples were incorporated for qualitative examination. We believe that the atom-in-SMILES tokenization has a great potential to be adopted by broad related scientific communities, as it provides chemically accurate, tailor-made tokens for molecular property prediction, chemical translation, and molecular generative models.
Collapse
Affiliation(s)
- Umit V Ucak
- Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Republic of Korea
| | | | - Juyong Lee
- Research Institute of Pharmaceutical Science, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
15
|
Guha R, Velegol D. Harnessing Shannon entropy-based descriptors in machine learning models to enhance the prediction accuracy of molecular properties. J Cheminform 2023; 15:54. [PMID: 37211605 DOI: 10.1186/s13321-023-00712-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Accepted: 03/18/2023] [Indexed: 05/23/2023] Open
Abstract
Accurate prediction of molecular properties is essential in the screening and development of drug molecules and other functional materials. Traditionally, property-specific molecular descriptors are used in machine learning models. This in turn requires the identification and development of target or problem-specific descriptors. Additionally, an increase in the prediction accuracy of the model is not always feasible from the standpoint of targeted descriptor usage. We explored the accuracy and generalizability issues using a framework of Shannon entropies, based on SMILES, SMARTS and/or InChiKey strings of respective molecules. Using various public databases of molecules, we showed that the accuracy of the prediction of machine learning models could be significantly enhanced simply by using Shannon entropy-based descriptors evaluated directly from SMILES. Analogous to partial pressures and total pressure of gases in a mixture, we used atom-wise fractional Shannon entropy in combination with total Shannon entropy from respective tokens of the string representation to model the molecule efficiently. The proposed descriptor was competitive in performance with standard descriptors such as Morgan fingerprints and SHED in regression models. Additionally, we found that either a hybrid descriptor set containing the Shannon entropy-based descriptors or an optimized, ensemble architecture of multilayer perceptrons and graph neural networks using the Shannon entropies was synergistic to improve the prediction accuracy. This simple approach of coupling the Shannon entropy framework to other standard descriptors and/or using it in ensemble models could find applications in boosting the performance of molecular property predictions in chemistry and material science.
Collapse
Affiliation(s)
- Rajarshi Guha
- Intel Corporation, 2501 NE Century Blvd, Hillsboro, OR, 97124, USA.
| | - Darrell Velegol
- Department of Chemical Engineering, Pennsylvania State University, University Park, PA, 16802, USA
| |
Collapse
|
16
|
Kang JK, Lee D, Muambo KE, Choi JW, Oh JE. Development of an embedded molecular structure-based model for prediction of micropollutant treatability in a drinking water treatment plant by machine learning from three years monitoring data. WATER RESEARCH 2023; 239:120037. [PMID: 37182312 DOI: 10.1016/j.watres.2023.120037] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 04/25/2023] [Accepted: 05/01/2023] [Indexed: 05/16/2023]
Abstract
In this study, an autoencoder-based molecular structure embedding model was developed to predict treatability of micropollutant in a drinking water treatment plant (DWTP) by machine learning using 69 micropollutants monitoring data at 18 DWTPs for three years. The molecular structure, which contains physicochemical characteristics, was embedded as a fixed-length vector that is advantageous for data-driven analysis and machine learning. First, the molecular structure of the micropollutants was converted to a sequence of tokens using the simplified molecular-input line-entry system (SMILES) pair encoding tokenizer, a frequency-based tokenization method. It was then compressed into fixed-length vectors using an autoencoder trained on various molecular structures within the Chemical Entities of Biological Interest. To validate the proposed models, a binary classification of micropollutant treatability was performed using the embedded molecular structure of micropollutants with various external features, such as concentration, season, and the presence of specific drinking water treatment processes by machine learning. The accuracy of the developed model for the 69 micropollutants in this study was 0.86, and the molecular structure was determined to be the most important feature. Furthermore, an accuracy of 0.71 was obtained in external validation for pharmaceuticals and personal care products that were not used for training. This shows that the proposed embedding vector can be generalized to unseen molecules during the training process, which means that it reflects the characteristics of the molecular structures.
Collapse
Affiliation(s)
- Jin-Kyu Kang
- Institute for Environment and Energy, Pusan National University, Busan 46241, Republic of Korea
| | | | - Kimberly Etombi Muambo
- Department of Civil and Environmental Engineering, Pusan National University, Busan 46241, Republic of Korea
| | - Jae-Won Choi
- Department of Water Environmental Safety Management, K-water, Shintanjinro 200 Daeduck, Daejeon 34350, Republic of Korea
| | - Jeong-Eun Oh
- Institute for Environment and Energy, Pusan National University, Busan 46241, Republic of Korea; Department of Civil and Environmental Engineering, Pusan National University, Busan 46241, Republic of Korea.
| |
Collapse
|
17
|
Jaume-Santero F, Bornet A, Valery A, Naderi N, Vicente Alvarez D, Proios D, Yazdani A, Bournez C, Fessard T, Teodoro D. Transformer Performance for Chemical Reactions: Analysis of Different Predictive and Evaluation Scenarios. J Chem Inf Model 2023; 63:1914-1924. [PMID: 36952584 PMCID: PMC10091402 DOI: 10.1021/acs.jcim.2c01407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Abstract
The prediction of chemical reaction pathways has been accelerated by the development of novel machine learning architectures based on the deep learning paradigm. In this context, deep neural networks initially designed for language translation have been used to accurately predict a wide range of chemical reactions. Among models suited for the task of language translation, the recently introduced molecular transformer reached impressive performance in terms of forward-synthesis and retrosynthesis predictions. In this study, we first present an analysis of the performance of transformer models for product, reactant, and reagent prediction tasks under different scenarios of data availability and data augmentation. We find that the impact of data augmentation depends on the prediction task and on the metric used to evaluate the model performance. Second, we probe the contribution of different combinations of input formats, tokenization schemes, and embedding strategies to model performance. We find that less stable input settings generally lead to better performance. Lastly, we validate the superiority of round-trip accuracy over simpler evaluation metrics, such as top-k accuracy, using a committee of human experts and show a strong agreement for predictions that pass the round-trip test. This demonstrates the usefulness of more elaborate metrics in complex predictive scenarios and highlights the limitations of direct comparisons to a predefined database, which may include a limited number of chemical reaction pathways.
Collapse
Affiliation(s)
- Fernando Jaume-Santero
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
| | - Alban Bornet
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
| | | | - Nona Naderi
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - David Vicente Alvarez
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
| | - Dimitrios Proios
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
| | - Anthony Yazdani
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
| | | | | | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, 1205 Geneva, Switzerland
- Geneva School of Business Administration, HES-SO University of Applied Sciences and Arts of Western Switzerland, 1227 Geneva, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
18
|
Tysinger EP, Rai BK, Sinitskiy AV. Can We Quickly Learn to "Translate" Bioactive Molecules with Transformer Models? J Chem Inf Model 2023; 63:1734-1744. [PMID: 36914216 DOI: 10.1021/acs.jcim.2c01618] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/16/2023]
Abstract
Meaningful exploration of the chemical space of druglike molecules in drug design is a highly challenging task due to a combinatorial explosion of possible modifications of molecules. In this work, we address this problem with transformer models, a type of machine learning (ML) model originally developed for machine translation. By training transformer models on pairs of similar bioactive molecules from the public ChEMBL data set, we enable them to learn medicinal-chemistry-meaningful, context-dependent transformations of molecules, including those absent from the training set. By retrospective analysis on the performance of transformer models on ChEMBL subsets of ligands binding to COX2, DRD2, or HERG protein targets, we demonstrate that the models can generate structures identical or highly similar to most active ligands, despite the models having not seen any ligands active against the corresponding protein target during training. Our work demonstrates that human experts working on hit expansion in drug design can easily and quickly employ transformer models, originally developed to translate texts from one natural language to another, to "translate" from known molecules active against a given protein target to novel molecules active against the same target.
Collapse
Affiliation(s)
- Emma P Tysinger
- Machine Learning and Computational Sciences, Pfizer Worldwide Research, Development, and Medical, 610 Main Street, Cambridge, Massachusetts 02139, United States
| | - Brajesh K Rai
- Machine Learning and Computational Sciences, Pfizer Worldwide Research, Development, and Medical, 610 Main Street, Cambridge, Massachusetts 02139, United States
| | - Anton V Sinitskiy
- Machine Learning and Computational Sciences, Pfizer Worldwide Research, Development, and Medical, 610 Main Street, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
19
|
Jiang J, Zhang R, Ma J, Liu Y, Yang E, Du S, Zhao Z, Yuan Y. TranGRU: focusing on both the local and global information of molecules for molecular property prediction. APPL INTELL 2022; 53:15246-15260. [PMID: 36405344 PMCID: PMC9662124 DOI: 10.1007/s10489-022-04280-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/17/2022] [Indexed: 11/16/2022]
Abstract
Molecular property prediction is an essential but challenging task in drug discovery. The recurrent neural network (RNN) and Transformer are the mainstream methods for sequence modeling, and both have been successfully applied independently for molecular property prediction. As the local information and global information of molecules are very important for molecular properties, we aim to integrate the bi-directional gated recurrent unit (BiGRU) into the original Transformer encoder, together with self-attention to better capture local and global molecular information simultaneously. To this end, we propose the TranGRU approach, which encodes the local and global information of molecules by using the BiGRU and self-attention, respectively. Then, we use a gate mechanism to reasonably fuse the two molecular representations. In this way, we enhance the ability of the proposed model to encode both local and global molecular information. Compared to the baselines and state-of-the-art methods when treating each task as a single-task classification on Tox21, the proposed approach outperforms the baselines on 9 out of 12 tasks and state-of-the-art methods on 5 out of 12 tasks. TranGRU also obtains the best ROC-AUC scores on BBBP, FDA, LogP, and Tox21 (multitask classification) and has a comparable performance on ToxCast, BACE, and ecoli. On the whole, TranGRU achieves better performance for molecular property prediction. The source code is available in GitHub: https://github.com/Jiangjing0122/TranGRU.
Collapse
Affiliation(s)
- Jing Jiang
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
- Key Laboratory of China’s Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Baiyin Road, Lanzhou, 730030 Gansu China
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Jun Ma
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Yunwu Liu
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Enjie Yang
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Shikang Du
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Zhili Zhao
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| | - Yongna Yuan
- School of Information Science and Engineering, Lanzhou University, Tianshui Road, Lanzhou, 730000 Gansu China
| |
Collapse
|
20
|
Discovering design principles of collagen molecular stability using a genetic algorithm, deep learning, and experimental validation. Proc Natl Acad Sci U S A 2022; 119:e2209524119. [PMID: 36161946 DOI: 10.1073/pnas.2209524119] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Collagen is the most abundant structural protein in humans, providing crucial mechanical properties, including high strength and toughness, in tissues. Collagen-based biomaterials are, therefore, used for tissue repair and regeneration. Utilizing collagen effectively during materials processing ex vivo and subsequent function in vivo requires stability over wide temperature ranges to avoid denaturation and loss of structure, measured as melting temperature (Tm). Although significant research has been conducted on understanding how collagen primary amino acid sequences correspond to Tm values, a robust framework to facilitate the design of collagen sequences with specific Tm remains a challenge. Here, we develop a general model using a genetic algorithm within a deep learning framework to design collagen sequences with specific Tm values. We report 1,000 de novo collagen sequences, and we show that we can efficiently use this model to generate collagen sequences and verify their Tm values using both experimental and computational methods. We find that the model accurately predicts Tm values within a few degrees centigrade. Further, using this model, we conduct a high-throughput study to identify the most frequently occurring collagen triplets that can be directly incorporated into collagen. We further discovered that the number of hydrogen bonds within collagen calculated with molecular dynamics (MD) is directly correlated to the experimental measurement of triple-helical quality. Ultimately, we see this work as a critical step to helping researchers develop collagen sequences with specific Tm values for intended materials manufacturing methods and biomedical applications, realizing a mechanistic materials by design paradigm.
Collapse
|
21
|
Jiang J, Zhang R, Zhao Z, Ma J, Liu Y, Yuan Y, Niu B. MultiGran-SMILES: multi-granularity SMILES learning for molecular property prediction. Bioinformatics 2022; 38:4573-4580. [PMID: 35961025 DOI: 10.1093/bioinformatics/btac550] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 07/07/2022] [Accepted: 08/10/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Extracting useful molecular features is essential for molecular property prediction. Atom-level representation is a common representation of molecules, ignoring the sub-structure or branch information of molecules to some extent; however, it is vice versa for the substring-level representation. Both atom-level and substring-level representations may lose the neighborhood or spatial information of molecules. While molecular graph representation aggregating the neighborhood information of a molecule has a weak ability in expressing the chiral molecules or symmetrical structure. In this article, we aim to make use of the advantages of representations in different granularities simultaneously for molecular property prediction. To this end, we propose a fusion model named MultiGran-SMILES, which integrates the molecular features of atoms, sub-structures and graphs from the input. Compared with the single granularity representation of molecules, our method leverages the advantages of various granularity representations simultaneously and adjusts the contribution of each type of representation adaptively for molecular property prediction. RESULTS The experimental results show that our MultiGran-SMILES method achieves state-of-the-art performance on BBBP, LogP, HIV and ClinTox datasets. For the BACE, FDA and Tox21 datasets, the results are comparable with the state-of-the-art models. Moreover, the experimental results show that the gains of our proposed method are bigger for the molecules with obvious functional groups or branches. AVAILABILITY AND IMPLEMENTATION The code and data underlying this work are available on GitHub at https://github. com/Jiangjing0122/MultiGran. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jing Jiang
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
- Key Laboratory of China's Ethnic Languages and Information Technology of Ministry of Education, Northwest Minzu University, Lanzhou 730030, China
| | - Ruisheng Zhang
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
| | - Zhili Zhao
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
| | - Jun Ma
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
| | - Yunwu Liu
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
| | - Yongna Yuan
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
| | - Bojuan Niu
- School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China
| |
Collapse
|
22
|
Deep learning methods for molecular representation and property prediction. Drug Discov Today 2022; 27:103373. [PMID: 36167282 DOI: 10.1016/j.drudis.2022.103373] [Citation(s) in RCA: 35] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 08/22/2022] [Accepted: 09/21/2022] [Indexed: 01/11/2023]
Abstract
With advances in artificial intelligence (AI) methods, computer-aided drug design (CADD) has developed rapidly in recent years. Effective molecular representation and accurate property prediction are crucial tasks in CADD workflows. In this review, we summarize contemporary applications of deep learning (DL) methods for molecular representation and property prediction. We categorize DL methods according to the format of molecular data (1D, 2D, and 3D). In addition, we discuss some common DL models, such as ensemble learning and transfer learning, and analyze the interpretability methods for these models. We also highlight the challenges and opportunities of DL methods for molecular representation and property prediction.
Collapse
|
23
|
Gao Y, Chen S, Tong J, Fu X. Topology-enhanced molecular graph representation for anti-breast cancer drug selection. BMC Bioinformatics 2022; 23:382. [PMID: 36123643 PMCID: PMC9484163 DOI: 10.1186/s12859-022-04913-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2022] [Accepted: 08/24/2022] [Indexed: 12/24/2022] Open
Abstract
Background Breast cancer is currently one of the cancers with a higher mortality rate in the world. The biological research on anti-breast cancer drugs focuses on the activity of estrogen receptors alpha (ER\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\alpha$$\end{document}α), the pharmacokinetic properties and the safety of the compounds, which, however, is an expensive and time-consuming process. Developments of deep learning bring potential to efficiently facilitate the candidate drug selection against breast cancer. Methods In this paper, we propose an Anti-Breast Cancer Drug selection method utilizing Gated Graph Neural Networks (ABCD-GGNN) to topologically enhance the molecular representation of candidate drugs. By constructing atom-level graphs through atomic descriptors for each distinct compound, ABCD-GGNN can topologically learn both the implicit structure and substructure characteristics of a candidate drug and then integrate the representation with explicit discrete molecular descriptors to generate a molecule-level representation. As a result, the representation of ABCD-GGNN can inductively predict the ER\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\alpha$$\end{document}α, the pharmacokinetic properties and the safety of each candidate drug. Finally, we design a ranking operator whose inputs are the predicted properties so as to statistically select the appropriate drugs against breast cancer. Results Extensive experiments conducted on our collected anti-breast cancer candidate drug dataset demonstrate that our proposed method outperform all the other representative methods in the tasks of predicting ER\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\alpha$$\end{document}α, and the pharmacokinetic properties and safety of the compounds. Extended result analysis demonstrates the efficiency and biological rationality of the operator we design to calculate the candidate drug ranking from the predicted properties. Conclusion In this paper, we propose the ABCD-GGNN representation method to efficiently integrate the topological structure and substructure features of the molecules with the discrete molecular descriptors. With a ranking operator applied, the predicted properties efficiently facilitate the candidate drug selection against breast cancer. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04913-6.
Collapse
Affiliation(s)
- Yue Gao
- School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China.,Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing, China
| | - Songling Chen
- School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China.,Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing, China
| | - Junyi Tong
- School of Science, Beijing University of Posts and Telecommunications, Beijing, China
| | - Xiangling Fu
- School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing, China. .,Key Laboratory of Trustworthy Distributed Computing and Service (BUPT), Ministry of Education, Beijing, China.
| |
Collapse
|
24
|
Uludoğan G, Ozkirimli E, Ulgen KO, Karalı N, Özgür A. Exploiting pretrained biochemical language models for targeted drug design. Bioinformatics 2022; 38:ii155-ii161. [PMID: 36124801 DOI: 10.1093/bioinformatics/btac482] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION The development of novel compounds targeting proteins of interest is one of the most important tasks in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. Recently, target-specific molecule generation has been viewed as a translation between the protein language and the chemical language. However, such a model is limited by the availability of interacting protein-ligand pairs. On the other hand, large amounts of unlabelled protein sequences and chemical compounds are available and have been used to train language models that learn useful representations. In this study, we propose exploiting pretrained biochemical language models to initialize (i.e. warm start) targeted molecule generation models. We investigate two warm start strategies: (i) a one-stage strategy where the initialized model is trained on targeted molecule generation and (ii) a two-stage strategy containing a pre-finetuning on molecular generation followed by target-specific training. We also compare two decoding strategies to generate compounds: beam search and sampling. RESULTS The results show that the warm-started models perform better than a baseline model trained from scratch. The two proposed warm-start strategies achieve similar results to each other with respect to widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a number of novel proteins suggests that the one-stage strategy generalizes better than the two-stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics for assessing compound quality. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/boun-tabi/biochemical-lms-for-drug-design and the materials (i.e., data, models, and outputs) are archived in Zenodo at https://doi.org/10.5281/zenodo.6832145. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gökçe Uludoğan
- Department of Computer Engineering, Boğaziçi University, İstanbul 34342, Turkey
| | - Elif Ozkirimli
- Data and Analytics Chapter, Pharma International Informatics, F. Hoffmann-La Roche AG 4303, Switzerland
| | - Kutlu O Ulgen
- Department of Chemical Engineering, Boğaziçi University, İstanbul 34342, Turkey
| | - Nilgün Karalı
- Faculty of Pharmacy, Department of Pharmaceutical Chemistry, İstanbul University, İstanbul 34116, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, İstanbul 34342, Turkey
| |
Collapse
|
25
|
Tang Q, Nie F, Zhao Q, Chen W. A merged molecular representation deep learning method for blood-brain barrier permeability prediction. Brief Bioinform 2022; 23:6674486. [PMID: 36002937 DOI: 10.1093/bib/bbac357] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2022] [Revised: 07/27/2022] [Accepted: 07/30/2022] [Indexed: 12/30/2022] Open
Abstract
The ability of a compound to permeate across the blood-brain barrier (BBB) is a significant factor for central nervous system drug development. Thus, for speeding up the drug discovery process, it is crucial to perform high-throughput screenings to predict the BBB permeability of the candidate compounds. Although experimental methods are capable of determining BBB permeability, they are still cost-ineffective and time-consuming. To complement the shortcomings of existing methods, we present a deep learning-based multi-model framework model, called Deep-B3, to predict the BBB permeability of candidate compounds. In Deep-B3, the samples are encoded in three kinds of features, namely molecular descriptors and fingerprints, molecular graph and simplified molecular input line entry system (SMILES) text notation. The pre-trained models were built to extract latent features from the molecular graph and SMILES. These features depicted the compounds in terms of tabular data, image and text, respectively. The validation results yielded from the independent dataset demonstrated that the performance of Deep-B3 is superior to that of the state-of-the-art models. Hence, Deep-B3 holds the potential to become a useful tool for drug development. A freely available online web-server for Deep-B3 was established at http://cbcb.cdutcm.edu.cn/deepb3/, and the source code and dataset of Deep-B3 are available at https://github.com/GreatChenLab/Deep-B3.
Collapse
Affiliation(s)
- Qiang Tang
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medical Science, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fulei Nie
- School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| | - Qi Zhao
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, China
| | - Wei Chen
- State Key Laboratory of Southwestern Chinese Medicine Resources, School of Basic Medical Science, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.,School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| |
Collapse
|
26
|
Zeng Y, Chen X, Peng D, Zhang L, Huang H. Multi-scaled self-attention for drug-target interaction prediction based on multi-granularity representation. BMC Bioinformatics 2022; 23:314. [PMID: 35922768 PMCID: PMC9347097 DOI: 10.1186/s12859-022-04857-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 07/22/2022] [Indexed: 11/21/2022] Open
Abstract
Background Drug–target interaction (DTI) prediction plays a crucial role in drug discovery. Although the advanced deep learning has shown promising results in predicting DTIs, it still needs improvements in two aspects: (1) encoding method, in which the existing encoding method, character encoding, overlooks chemical textual information of atoms with multiple characters and chemical functional groups; as well as (2) the architecture of deep model, which should focus on multiple chemical patterns in drug and target representations. Results In this paper, we propose a multi-granularity multi-scaled self-attention (SAN) model by alleviating the above problems. Specifically, in process of encoding, we investigate a segmentation method for drug and protein sequences and then label the segmented groups as the multi-granularity representations. Moreover, in order to enhance the various local patterns in these multi-granularity representations, a multi-scaled SAN is built and exploited to generate deep representations of drugs and targets. Finally, our proposed model predicts DTIs based on the fusion of these deep representations. Our proposed model is evaluated on two benchmark datasets, KIBA and Davis. The experimental results reveal that our proposed model yields better prediction accuracy than strong baseline models. Conclusion Our proposed multi-granularity encoding method and multi-scaled SAN model improve DTI prediction by encoding the chemical textual information of drugs and targets and extracting their various local patterns, respectively.
Collapse
Affiliation(s)
- Yuni Zeng
- School of Information Science and Technology, Zhejiang Sci-Tech University, Hangzhou, China
| | - Xiangru Chen
- College of Computer Science, Sichuan University, Chengdu, China
| | - Dezhong Peng
- College of Computer Science, Sichuan University, Chengdu, China.,Shenzhen Peng Cheng Laboratory, Shenzhen, China.,Chengdu Sobey Digital Technology Co., Ltd, Chengdu, China
| | - Lijun Zhang
- Sichuan Zhiqian Technology Co., Ltd, Chengdu, China.,Chengdu Ruibei Yingte Information Technology Co., Ltd, Chengdu, China
| | - Haixiao Huang
- Sichuan Provincial Commission of Politics and Law, Chengdu, China.
| |
Collapse
|
27
|
Sreenivasan AP, Harrison PJ, Schaal W, Matuszewski DJ, Kultima K, Spjuth O. Predicting protein network topology clusters from chemical structure using deep learning. J Cheminform 2022; 14:47. [PMID: 35841114 PMCID: PMC9284831 DOI: 10.1186/s13321-022-00622-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 06/06/2022] [Indexed: 11/10/2022] Open
Abstract
Comparing chemical structures to infer protein targets and functions is a common approach, but basing comparisons on chemical similarity alone can be misleading. Here we present a methodology for predicting target protein clusters using deep neural networks. The model is trained on clusters of compounds based on similarities calculated from combined compound-protein and protein-protein interaction data using a network topology approach. We compare several deep learning architectures including both convolutional and recurrent neural networks. The best performing method, the recurrent neural network architecture MolPMoFiT, achieved an F1 score approaching 0.9 on a held-out test set of 8907 compounds. In addition, in-depth analysis on a set of eleven well-studied chemical compounds with known functions showed that predictions were justifiable for all but one of the chemicals. Four of the compounds, similar in their molecular structure but with dissimilarities in their function, revealed advantages of our method compared to using chemical similarity.
Collapse
Affiliation(s)
- Akshai P Sreenivasan
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden.,Department of Medical Sciences, Uppsala University, Uppsala, Sweden
| | - Philip J Harrison
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden
| | - Wesley Schaal
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden
| | - Damian J Matuszewski
- Centre for Image Analysis, Department of Information Technology, Uppsala University, Uppsala, Sweden
| | - Kim Kultima
- Department of Medical Sciences, Uppsala University, Uppsala, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, 75124, Uppsala, Sweden.
| |
Collapse
|
28
|
InflamNat: web-based database and predictor of anti-inflammatory natural products. J Cheminform 2022; 14:30. [PMID: 35659771 PMCID: PMC9167499 DOI: 10.1186/s13321-022-00608-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2021] [Accepted: 05/09/2022] [Indexed: 11/30/2022] Open
Abstract
Natural products (NPs) are a valuable source for anti-inflammatory drug discovery. However, they are limited by the unpredictability of the structures and functions. Therefore, computational and data-driven pre-evaluation could enable more efficient NP-inspired drug development. Since NPs possess structural features that differ from synthetic compounds, models trained with synthetic compounds may not perform well with NPs. There is also an urgent demand for well-curated databases and user-friendly predictive tools. We presented a comprehensive online web platform (InflamNat, http://www.inflamnat.com/ or http://39.104.56.4/) for anti-inflammatory natural product research. InflamNat is a database containing the physicochemical properties, cellular anti-inflammatory bioactivities, and molecular targets of 1351 NPs that tested on their anti-inflammatory activities. InflamNat provides two machine learning-based predictive tools specifically designed for NPs that (a) predict the anti-inflammatory activity of NPs, and (b) predict the compound-target relationship for compounds and targets collected in the database but lacking existing relationship data. A novel multi-tokenization transformer model (MTT) was proposed as the sequential encoder for both predictive tools to obtain a high-quality representation of sequential data. The experimental results showed that the proposed predictive tools achieved an AUC value of 0.842 and 0.872 in the prediction of anti-inflammatory activity and compound-target interactions, respectively.
Collapse
|
29
|
Godinez WJ, Ma EJ, Chao AT, Pei L, Skewes-Cox P, Canham SM, Jenkins JL, Young JM, Martin EJ, Guiguemde WA. Design of potent antimalarials with generative chemistry. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00448-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
30
|
ColGen: An end-to-end deep learning model to predict thermal stability of de novo collagen sequences. J Mech Behav Biomed Mater 2021; 125:104921. [PMID: 34758444 DOI: 10.1016/j.jmbbm.2021.104921] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 10/21/2021] [Indexed: 11/22/2022]
Abstract
Collagen is the most abundant structural protein in humans, with dozens of sequence variants accounting for over 30% of the protein in an animal body. The fibrillar and hierarchical arrangements of collagen are critical in providing mechanical properties with high strength and toughness. Due to this ubiquitous role in human tissues, collagen-based biomaterials are commonly used for tissue repairs and regeneration, requiring chemical and thermal stability over a range of temperatures during materials preparation ex vivo and subsequent utility in vivo. Collagen unfolds from a triple helix to a random coil structure during a temperature interval in which the midpoint or Tm is used as a measure to evaluate the thermal stability of the molecules. However, finding a robust framework to facilitate the design of a specific collagen sequence to yield a specific Tm remains a challenge, including using conventional molecular dynamics modeling. Here we propose a de novo framework to provide a model that outputs the Tm values of input collagen sequences by incorporating deep learning trained on a large data set of collagen sequences and corresponding Tm values. By using this framework, we are able to quickly evaluate how mutations and order in the primary sequence affect the stability of collagen triple helices. Specifically, we confirm that mutations to glycines, mutations in the middle of a sequence, and short sequence lengths cause the greatest drop in Tm values.
Collapse
|
31
|
An X, Chen X, Yi D, Li H, Guan Y. Representation of molecules for drug response prediction. Brief Bioinform 2021; 23:6375515. [PMID: 34571534 DOI: 10.1093/bib/bbab393] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Revised: 08/28/2021] [Accepted: 08/30/2021] [Indexed: 12/18/2022] Open
Abstract
The rapid development of machine learning and deep learning algorithms in the recent decade has spurred an outburst of their applications in many research fields. In the chemistry domain, machine learning has been widely used to aid in drug screening, drug toxicity prediction, quantitative structure-activity relationship prediction, anti-cancer synergy score prediction, etc. This review is dedicated to the application of machine learning in drug response prediction. Specifically, we focus on molecular representations, which is a crucial element to the success of drug response prediction and other chemistry-related prediction tasks. We introduce three types of commonly used molecular representation methods, together with their implementation and application examples. This review will serve as a brief introduction of the broad field of molecular representations.
Collapse
Affiliation(s)
- Xin An
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Xi Chen
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Daiyao Yi
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|