1
|
Michels J, Bandarupalli R, Akbari AA, Le T, Xiao H, Li J, Hom EFY. Natural Language Processing Methods for the Study of Protein-Ligand Interactions. ARXIV 2024:arXiv:2409.13057v2. [PMID: 39483353 PMCID: PMC11527106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages and is increasingly influential in the study of protein and ligand binding, which is critical for drug discovery and development. This review examines how NLP techniques have been adapted to decode the "language" of proteins and small molecule ligands to predict protein-ligand interactions (PLIs). We discuss how methods such as long short-term memory (LSTM) networks, transformers, and attention mechanisms can leverage different protein and ligand data types to identify potential interaction patterns. Significant challenges are highlighted, including the scarcity of high-quality negative data, difficulties in interpreting model decisions, and sampling biases of existing datasets. We argue that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.
Collapse
Affiliation(s)
- James Michels
- Department of Computer Science, University of Mississippi, University, MS
| | - Ramya Bandarupalli
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Amin Ahangar Akbari
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Thai Le
- Department of Computer Science, Indiana University, Bloomington, IN
| | - Hong Xiao
- Department of Computer Science, University of Mississippi, University, MS
| | - Jing Li
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Erik F Y Hom
- Department of Biology and Center for Biodiversity and Conservation Research, University of Mississippi, University, MS
| |
Collapse
|
2
|
Sun YY, Hsieh CY, Wen JH, Tseng TY, Huang JH, Oyang YJ, Huang HC, Juan HF. scDrug+: predicting drug-responses using single-cell transcriptomics and molecular structure. Biomed Pharmacother 2024; 177:117070. [PMID: 38964180 DOI: 10.1016/j.biopha.2024.117070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 06/18/2024] [Accepted: 06/29/2024] [Indexed: 07/06/2024] Open
Abstract
Predicting drug responses based on individual transcriptomic profiles holds promise for refining prognosis and advancing precision medicine. Although many studies have endeavored to predict the responses of known drugs to novel transcriptomic profiles, research into predicting responses for newly discovered drugs remains sparse. In this study, we introduce scDrug+, a comprehensive pipeline that seamlessly integrates single-cell analysis with drug-response prediction. Importantly, scDrug+ is equipped to predict the response of new drugs by analyzing their molecular structures. The open-source tool is available as a Docker container, ensuring ease of deployment and reproducibility. It can be accessed at https://github.com/ailabstw/scDrugplus.
Collapse
Affiliation(s)
- Yih-Yun Sun
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan; Taiwan AI Labs, Taipei 10351, Taiwan
| | | | - Jian-Hung Wen
- Taiwan AI Labs, Taipei 10351, Taiwan; Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei 11221, Taiwan
| | - Tzu-Yang Tseng
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan; Department of Life Science, National Taiwan University, Taipei 106, Taiwan
| | | | - Yen-Jen Oyang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan
| | - Hsuan-Cheng Huang
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, Taipei 11221, Taiwan.
| | - Hsueh-Fen Juan
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan; Taiwan AI Labs, Taipei 10351, Taiwan; Department of Life Science, National Taiwan University, Taipei 106, Taiwan; Center for Computational and Systems Biology, National Taiwan University, Taipei 106, Taiwan; Center for Advanced Computing and Imaging in Biomedicine, National Taiwan University, Taipei 106, Taiwan.
| |
Collapse
|
3
|
Prabhu H, Bhosale H, Sane A, Dhadwal R, Ramakrishnan V, Valadi J. Protein feature engineering framework for AMPylation site prediction. Sci Rep 2024; 14:8695. [PMID: 38622194 PMCID: PMC11369087 DOI: 10.1038/s41598-024-58450-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 03/29/2024] [Indexed: 04/17/2024] Open
Abstract
AMPylation is a biologically significant yet understudied post-translational modification where an adenosine monophosphate (AMP) group is added to Tyrosine and Threonine residues primarily. While recent work has illuminated the prevalence and functional impacts of AMPylation, experimental identification of AMPylation sites remains challenging. Computational prediction techniques provide a faster alternative approach. The predictive performance of machine learning models is highly dependent on the features used to represent the raw amino acid sequences. In this work, we introduce a novel feature extraction pipeline to encode the key properties relevant to AMPylation site prediction. We utilize a recently published dataset of curated AMPylation sites to develop our feature generation framework. We demonstrate the utility of our extracted features by training various machine learning classifiers, on various numerical representations of the raw sequences extracted with the help of our framework. Tenfold cross-validation is used to evaluate the model's capability to distinguish between AMPylated and non-AMPylated sites. The top-performing set of features extracted achieved MCC score of 0.58, Accuracy of 0.8, AUC-ROC of 0.85 and F1 score of 0.73. Further, we elucidate the behaviour of the model on the set of features consisting of monogram and bigram counts for various representations using SHapley Additive exPlanations.
Collapse
Affiliation(s)
- Hardik Prabhu
- Computing and Data Sciences, FLAME University, Pune, 412115, India
- Robert Bosch Centre for Cyber Physical Systems, Indian Institute of Science, Bengaluru, 560012, India
| | | | - Aamod Sane
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Renu Dhadwal
- Computing and Data Sciences, FLAME University, Pune, 412115, India
| | - Vigneshwar Ramakrishnan
- Bioinformatics Center, School of Chemical and Biotechnology, SASTRA Deemed to be University, Thanjavur, 613401, India
| | - Jayaraman Valadi
- Computing and Data Sciences, FLAME University, Pune, 412115, India.
| |
Collapse
|
4
|
Temizer AB, Uludoğan G, Özçelik R, Koulani T, Ozkirimli E, Ulgen KO, Karali N, Özgür A. Exploring data-driven chemical SMILES tokenization approaches to identify key protein-ligand binding moieties. Mol Inform 2024; 43:e202300249. [PMID: 38196065 DOI: 10.1002/minf.202300249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 11/13/2023] [Accepted: 01/06/2024] [Indexed: 01/11/2024]
Abstract
Machine learning models have found numerous successful applications in computational drug discovery. A large body of these models represents molecules as sequences since molecular sequences are easily available, simple, and informative. The sequence-based models often segment molecular sequences into pieces called chemical words, analogous to the words that make up sentences in human languages, and then apply advanced natural language processing techniques for tasks such as de novo drug design, property prediction, and binding affinity prediction. However, the chemical characteristics and significance of these building blocks, chemical words, remain unexplored. To address this gap, we employ data-driven SMILES tokenization techniques such as Byte Pair Encoding, WordPiece, and Unigram to identify chemical words and compare the resulting vocabularies. To understand the chemical significance of these words, we build a language-inspired pipeline that treats high affinity ligands of protein targets as documents and selects key chemical words making up those ligands based on tf-idf weighting. The experiments on multiple protein-ligand affinity datasets show that despite differences in words, lengths, and validity among the vocabularies generated by different subword tokenization algorithms, the identified key chemical words exhibit similarity. Further, we conduct case studies on a number of target to analyze the impact of key chemical words on binding. We find that these key chemical words are specific to protein targets and correspond to known pharmacophores and functional groups. Our approach elucidates chemical properties of the words identified by machine learning models and can be used in drug discovery studies to determine significant chemical moieties.
Collapse
Affiliation(s)
- Asu Busra Temizer
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
- Department of Pharmaceutical Chemistry, Institute of Health Sciences, İstanbul University, İstanbul, Turkey
| | - Gökçe Uludoğan
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Rıza Özçelik
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| | - Taha Koulani
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
- Department of Pharmaceutical Chemistry, Institute of Health Sciences, İstanbul University, İstanbul, Turkey
| | - Elif Ozkirimli
- Science and Research Informatics, F. Hoffmann-La Roche Ltd, Basel, Switzerland
| | - Kutlu O Ulgen
- Department of Chemical Engineering, Boğaziçi University, İstanbul, Turkey
| | - Nilgun Karali
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, İstanbul University, İstanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey
| |
Collapse
|
5
|
PSnpBind-ML: predicting the effect of binding site mutations on protein-ligand binding affinity. J Cheminform 2023; 15:31. [PMID: 36864534 PMCID: PMC9983232 DOI: 10.1186/s13321-023-00701-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 02/17/2023] [Indexed: 03/04/2023] Open
Abstract
Protein mutations, especially those which occur in the binding site, play an important role in inter-individual drug response and may alter binding affinity and thus impact the drug's efficacy and side effects. Unfortunately, large-scale experimental screening of ligand-binding against protein variants is still time-consuming and expensive. Alternatively, in silico approaches can play a role in guiding those experiments. Methods ranging from computationally cheaper machine learning (ML) to the more expensive molecular dynamics have been applied to accurately predict the mutation effects. However, these effects have been mostly studied on limited and small datasets, while ideally a large dataset of binding affinity changes due to binding site mutations is needed. In this work, we used the PSnpBind database with six hundred thousand docking experiments to train a machine learning model predicting protein-ligand binding affinity for both wild-type proteins and their variants with a single-point mutation in the binding site. A numerical representation of the protein, binding site, mutation, and ligand information was encoded using 256 features, half of them were manually selected based on domain knowledge. A machine learning approach composed of two regression models is proposed, the first predicting wild-type protein-ligand binding affinity while the second predicting the mutated protein-ligand binding affinity. The best performing models reported an RMSE value within 0.5 [Formula: see text] 0.6 kcal/mol-1 on an independent test set with an R2 value of 0.87 [Formula: see text] 0.90. We report an improvement in the prediction performance compared to several reported models developed for protein-ligand binding affinity prediction. The obtained models can be used as a complementary method in early-stage drug discovery. They can be applied to rapidly obtain a better overview of the ligand binding affinity changes across protein variants carried by people in the population and narrow down the search space where more time-demanding methods can be used to identify potential leads that achieve a better affinity for all protein variants.
Collapse
|
6
|
Lahorkar A, Bhosale H, Sane A, Ramakrishnan V, Jayaraman VK. Identification of Phase Separating Proteins With Distributed Reduced Alphabet Representations of Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:410-420. [PMID: 35139023 DOI: 10.1109/tcbb.2022.3149310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Phase separation of proteins play key roles in cellular physiology including bacterial division, tumorigenesis etc. Consequently, understanding the molecular forces that drive phase separation has gained considerable attention and several factors including hydrophobicity, protein dynamics, etc., have been implicated in phase separation. Data-driven identification of new phase separating proteins can enable in-depth understanding of cellular physiology and may pave way towards developing novel methods of tackling disease progression. In this work, we exploit the existing wealth of data on phase separating proteins to develop sequence-based machine learning method for prediction of phase separating proteins. We use reduced alphabet schemes based on hydrophobicity and conformational similarity along with distributed representation of protein sequences and biochemical properties as input features to Support Vector Machine (SVM) and Random Forest (RF) machine learning algorithms. We used both curated and balanced dataset for building the models. RF trained on balanced dataset with hydropathy, conformational similarity embeddings and biochemical properties achieved accuracy of 97%. Our work highlights the use of conformational similarity, a feature that reflects amino acid flexibility, and hydrophobicity for predicting phase separating proteins. Use of such "interpretable" features obtained from the ever-growing knowledgebase of phase separation is likely to improve prediction performances further.
Collapse
|
7
|
Yarish D, Garkot S, Grygorenko OO, Radchenko DS, Moroz YS, Gurbych O. Advancing molecular graphs with descriptors for the prediction of chemical reaction yields. J Comput Chem 2022; 44:76-92. [PMID: 36264601 DOI: 10.1002/jcc.27016] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Revised: 08/31/2022] [Accepted: 09/05/2022] [Indexed: 11/08/2022]
Abstract
Chemical yield is the percentage of the reactants converted to the desired products. Chemists use predictive algorithms to select high-yielding reactions and score synthesis routes, saving time and reagents. This study suggests a novel graph neural network architecture for chemical yield prediction. The network combines structural information about participants of the transformation as well as molecular and reaction-level descriptors. It works with incomplete chemical reactions and generates reactants-product atom mapping. We show that the network benefits from advanced information by comparing it with several machine learning models and molecular representations. Models included logistic regression, support vector machine, CatBoost, and Bidirectional Encoder Representations from Transformers. Molecular representations included extended-connectivity fingerprints, Morgan fingerprints, SMILESVec embeddings, and textual. Classification and regression objectives were assessed for each model and feature set. The goal of each classification model was to separate zero- and non-zero-yielding reactions. The models were trained and evaluated on a proprietary dataset of 10 reaction types. Also, the models were benchmarked on two public single reaction type datasets. The study was supplemented with analysis of data, results, and errors, as well as the impact of steric factors, side reactions, isolation, and purification efficiency. The supplementary code is available at https://github.com/SoftServeInc/yield-paper.
Collapse
Affiliation(s)
| | - Sofiya Garkot
- SoftServe, Inc., Lviv, Ukraine.,Ukrainian Catholic University, Lviv, Ukraine
| | - Oleksandr O Grygorenko
- Enamine Ltd., Kyiv, Ukraine.,Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
| | - Dmytro S Radchenko
- Enamine Ltd., Kyiv, Ukraine.,Taras Shevchenko National University of Kyiv, Kyiv, Ukraine
| | - Yurii S Moroz
- Taras Shevchenko National University of Kyiv, Kyiv, Ukraine.,Chemspace LLC, Kyiv, Ukraine
| | - Oleksandr Gurbych
- Lviv Polytechnic National University, Lviv, Ukraine.,Blackthorn AI, Ltd., London, UK
| |
Collapse
|
8
|
Gene expression based inference of cancer drug sensitivity. Nat Commun 2022; 13:5680. [PMID: 36167836 PMCID: PMC9515171 DOI: 10.1038/s41467-022-33291-z] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 09/12/2022] [Indexed: 11/09/2022] Open
Abstract
Inter and intra-tumoral heterogeneity are major stumbling blocks in the treatment of cancer and are responsible for imparting differential drug responses in cancer patients. Recently, the availability of high-throughput screening datasets has paved the way for machine learning based personalized therapy recommendations using the molecular profiles of cancer specimens. In this study, we introduce Precily, a predictive modeling approach to infer treatment response in cancers using gene expression data. In this context, we demonstrate the benefits of considering pathway activity estimates in tandem with drug descriptors as features. We apply Precily on single-cell and bulk RNA sequencing data associated with hundreds of cancer cell lines. We then assess the predictability of treatment outcomes using our in-house prostate cancer cell line and xenografts datasets exposed to differential treatment conditions. Further, we demonstrate the applicability of our approach on patient drug response data from The Cancer Genome Atlas and an independent clinical study describing the treatment journey of three melanoma patients. Our findings highlight the importance of chemo-transcriptomics approaches in cancer treatment selection. Predicting treatment response in cancer remains a highly complex task. Here, the authors develop Precily, a deep neural network framework to predict treatment response in cancer by considering gene expression, pathway activity estimates and drug features, and test this method in multiple datasets and preclinical models.
Collapse
|
9
|
Organizing the bacterial annotation space with amino acid sequence embeddings. BMC Bioinformatics 2022; 23:385. [PMID: 36151519 PMCID: PMC9502642 DOI: 10.1186/s12859-022-04930-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 08/11/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Due to the ever-expanding gap between the number of proteins being discovered and their functional characterization, protein function inference remains a fundamental challenge in computational biology. Currently, known protein annotations are organized in human-curated ontologies, however, all possible protein functions may not be organized accurately. Meanwhile, recent advancements in natural language processing and machine learning have developed models which embed amino acid sequences as vectors in n-dimensional space. So far, these embeddings have primarily been used to classify protein sequences using manually constructed protein classification schemes. RESULTS In this work, we describe the use of amino acid sequence embeddings as a systematic framework for studying protein ontologies. Using a sequence embedding, we show that the bacterial carbohydrate metabolism class within the SEED annotation system contains 48 clusters of embedded sequences despite this class containing 29 functional labels. Furthermore, by embedding Bacillus amino acid sequences with unknown functions, we show that these unknown sequences form clusters that are likely to have similar biological roles. CONCLUSIONS This study demonstrates that amino acid sequence embeddings may be a powerful tool for developing more robust ontologies for annotating protein sequence data. In addition, embeddings may be beneficial for clustering protein sequences with unknown functions and selecting optimal candidate proteins to characterize experimentally.
Collapse
|
10
|
Watanabe N, Yamamoto M, Murata M, Vavricka CJ, Ogino C, Kondo A, Araki M. Comprehensive Machine Learning Prediction of Extensive Enzymatic Reactions. J Phys Chem B 2022; 126:6762-6770. [PMID: 36053051 DOI: 10.1021/acs.jpcb.2c03287] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
New enzyme functions exist within the increasing number of unannotated protein sequences. Novel enzyme discovery is necessary to expand the pathways that can be accessed by metabolic engineering for the biosynthesis of functional compounds. Accordingly, various machine learning models have been developed to predict enzymatic reactions. However, the ability to predict unknown reactions that are not included in the training data has not been clarified. In order to cover uncertain and unknown reactions, a wider range of reaction types must be demonstrated by the models. Here, we establish 16 expanded enzymatic reaction prediction models developed using various machine learning algorithms, including deep neural network. Improvements in prediction performances over that of our previous study indicate that the updated methods are more effective for the prediction of enzymatic reactions. Overall, the deep neural network model trained with combined substrate-enzyme-product information exhibits the highest prediction accuracy with Macro F1 scores up to 0.966 and with robust prediction of unknown enzymatic reactions that are not included in the training data. This model can predict more extensive enzymatic reactions in comparison to previously reported models. This study will facilitate the discovery of new enzymes for the production of useful substances.
Collapse
Affiliation(s)
- Naoki Watanabe
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan
| | - Masaki Yamamoto
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan
| | - Masahiro Murata
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan
| | - Christopher J Vavricka
- Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan
| | - Chiaki Ogino
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan
| | - Akihiko Kondo
- Department of Chemical Science and Engineering Graduate School of Engineering, Kobe University, 1-1 Rokkodai-cho, Nada, Kobe, Hyogo 657-8501, Japan.,Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan
| | - Michihiro Araki
- Graduate School of Medicine, Kyoto University, 54 Kawahara-cho, Shogoin Sakyo-ku, Kyoto 606-8507, Japan.,Graduate School of Science, Technology and Innovation, Kobe University, 1-1 Rokkodai-cho, Nada-ku, Kobe 657-8501, Japan.,National Institutes of Biomedical Innovation, Health and Nutrition, National Institute of Health and Nutrition, 1-23-1 Toyama, Shinjuku-ku, Tokyo 162-8638, Japan.,National Cerebral and Cardiovascular Center, 6-1 Kishibe-Shinmachi, Suita, Osaka 564-8565, Japan
| |
Collapse
|
11
|
Jukič M, Bren U. Machine Learning in Antibacterial Drug Design. Front Pharmacol 2022; 13:864412. [PMID: 35592425 PMCID: PMC9110924 DOI: 10.3389/fphar.2022.864412] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2022] [Accepted: 03/28/2022] [Indexed: 12/17/2022] Open
Abstract
Advances in computer hardware and the availability of high-performance supercomputing platforms and parallel computing, along with artificial intelligence methods are successfully complementing traditional approaches in medicinal chemistry. In particular, machine learning is gaining importance with the growth of the available data collections. One of the critical areas where this methodology can be successfully applied is in the development of new antibacterial agents. The latter is essential because of the high attrition rates in new drug discovery, both in industry and in academic research programs. Scientific involvement in this area is even more urgent as antibacterial drug resistance becomes a public health concern worldwide and pushes us increasingly into the post-antibiotic era. In this review, we focus on the latest machine learning approaches used in the discovery of new antibacterial agents and targets, covering both small molecules and antibacterial peptides. For the benefit of the reader, we summarize all applied machine learning approaches and available databases useful for the design of new antibacterial agents and address the current shortcomings.
Collapse
Affiliation(s)
- Marko Jukič
- Laboratory of Physical Chemistry and Chemical Thermodynamics, Faculty of Chemistry and Chemical Engineering, University of Maribor, Maribor, Slovenia
- Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Koper, Slovenia
| | - Urban Bren
- Laboratory of Physical Chemistry and Chemical Thermodynamics, Faculty of Chemistry and Chemical Engineering, University of Maribor, Maribor, Slovenia
- Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Koper, Slovenia
| |
Collapse
|
12
|
Kalakoti Y, Yadav S, Sundar D. Deep Neural Network-Assisted Drug Recommendation Systems for Identifying Potential Drug-Target Interactions. ACS OMEGA 2022; 7:12138-12146. [PMID: 35449922 PMCID: PMC9016825 DOI: 10.1021/acsomega.2c00424] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 03/18/2022] [Indexed: 06/14/2023]
Abstract
In silico methods to identify novel drug-target interactions (DTIs) have gained significant importance over conventional techniques owing to their labor-intensive and low-throughput nature. Here, we present a machine learning-based multiclass classification workflow that segregates interactions between active, inactive, and intermediate drug-target pairs. Drug molecules, protein sequences, and molecular descriptors were transformed into machine-interpretable embeddings to extract critical features from standard datasets. Tools such as CHEMBL web resource, iFeature, and an in-house developed deep neural network-assisted drug recommendation (dNNDR)-featx were employed for data retrieval and processing. The models were trained with large-scale DTI datasets, which reported an improvement in performance over baseline methods. External validation results showed that models based on att-biLSTM and gCNN could help predict novel DTIs. When tested with a completely different dataset, the proposed models significantly outperformed competing methods. The validity of novel interactions predicted by dNNDR was backed by experimental and computational evidence in the literature. The proposed methodology could elucidate critical features that govern the relationship between a drug and its target.
Collapse
Affiliation(s)
- Yogesh Kalakoti
- DAILAB,
Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110 016, India
| | - Shashank Yadav
- DAILAB,
Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110 016, India
| | - Durai Sundar
- DAILAB,
Department of Biochemical Engineering & Biotechnology, Indian Institute of Technology (IIT) Delhi, New Delhi 110 016, India
- School
of Artificial Intelligence, Indian Institute
of Technology (IIT) Delhi, New Delhi 110 016, India
| |
Collapse
|
13
|
Moon S, Zhung W, Yang S, Lim J, Kim WY. PIGNet: a physics-informed deep learning model toward generalized drug-target interaction predictions. Chem Sci 2022; 13:3661-3673. [PMID: 35432900 PMCID: PMC8966633 DOI: 10.1039/d1sc06946b] [Citation(s) in RCA: 70] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Accepted: 02/06/2022] [Indexed: 12/21/2022] Open
Abstract
Recently, deep neural network (DNN)-based drug-target interaction (DTI) models were highlighted for their high accuracy with affordable computational costs. Yet, the models' insufficient generalization remains a challenging problem in the practice of in silico drug discovery. We propose two key strategies to enhance generalization in the DTI model. The first is to predict the atom-atom pairwise interactions via physics-informed equations parameterized with neural networks and provides the total binding affinity of a protein-ligand complex as their sum. We further improved the model generalization by augmenting a broader range of binding poses and ligands to training data. We validated our model, PIGNet, in the comparative assessment of scoring functions (CASF) 2016, demonstrating the outperforming docking and screening powers than previous methods. Our physics-informing strategy also enables the interpretation of predicted affinities by visualizing the contribution of ligand substructures, providing insights for further ligand optimization.
Collapse
Affiliation(s)
- Seokhyun Moon
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| | - Wonho Zhung
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| | - Soojung Yang
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| | - Jaechang Lim
- HITS Incorporation 124 Teheran-ro, Gangnam-gu Seoul 06234 Republic of Korea
| | - Woo Youn Kim
- Department of Chemistry, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
- HITS Incorporation 124 Teheran-ro, Gangnam-gu Seoul 06234 Republic of Korea
- KI for Artificial Intelligence, KAIST 291 Daehak-ro, Yuseong-gu Daejeon 34141 Republic of Korea
| |
Collapse
|
14
|
Wan X, Wu X, Wang D, Tan X, Liu X, Fu Z, Jiang H, Zheng M, Li X. An inductive graph neural network model for compound-protein interaction prediction based on a homogeneous graph. Brief Bioinform 2022; 23:6547264. [PMID: 35275993 PMCID: PMC9310259 DOI: 10.1093/bib/bbac073] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Revised: 02/09/2022] [Accepted: 02/11/2022] [Indexed: 01/10/2023] Open
Abstract
Identifying the potential compound–protein interactions (CPIs) plays an essential role in drug development. The computational approaches for CPI prediction can reduce time and costs of experimental methods and have benefited from the continuously improved graph representation learning. However, most of the network-based methods use heterogeneous graphs, which is challenging due to their complex structures and heterogeneous attributes. Therefore, in this work, we transformed the compound–protein heterogeneous graph to a homogeneous graph by integrating the ligand-based protein representations and overall similarity associations. We then proposed an Inductive Graph AggrEgator-based framework, named CPI-IGAE, for CPI prediction. CPI-IGAE learns the low-dimensional representations of compounds and proteins from the homogeneous graph in an end-to-end manner. The results show that CPI-IGAE performs better than some state-of-the-art methods. Further ablation study and visualization of embeddings reveal the advantages of the model architecture and its role in feature extraction, and some of the top ranked CPIs by CPI-IGAE have been validated by a review of recent literature. The data and source codes are available at https://github.com/wanxiaozhe/CPI-IGAE.
Collapse
Affiliation(s)
- Xiaozhe Wan
- State Key Laboratory of Drug Research, Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing 100049, China
| | - Xiaolong Wu
- State Key Laboratory of Drug Research, Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Dingyan Wang
- State Key Laboratory of Drug Research, Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; University of Chinese Academy of Sciences, No.19A Yuquan Road, Beijing 100049, China
| | | | - Xiaohong Liu
- AlphaMa Inc., No. 108, Yuxin Road, Suzhou Industrial Park, Suzhou 215128, China
| | - Zunyun Fu
- State Key Laboratory of Drug Research, Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
| | - Hualiang Jiang
- State Key Laboratory of Drug Research, Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China; School of Life Science and Technology, ShanghaiTech University, 393 Huaxiazhong Road, Shanghai 200031, China
| | - Mingyue Zheng
- State Key Laboratory of Drug Research, Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| | - Xutong Li
- State Key Laboratory of Drug Research, Drug Discovery and Design Center, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China; University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China
| |
Collapse
|
15
|
Li WX, Tong X, Yang PP, Zheng Y, Liang JH, Li GH, Liu D, Guan DG, Dai SX. Screening of antibacterial compounds with novel structure from the FDA approved drugs using machine learning methods. Aging (Albany NY) 2022; 14:1448-1472. [PMID: 35150482 PMCID: PMC8876917 DOI: 10.18632/aging.203887] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2021] [Accepted: 01/28/2022] [Indexed: 11/25/2022]
Abstract
Bacterial infection is one of the most important factors affecting the human life span. Elderly people are more harmed by bacterial infections due to their deficits in immunity. Because of the lack of new antibiotics in recent years, bacterial resistance has increasingly become a serious problem globally. In this study, an antibacterial compound predictor was constructed using the support vector machines and random forest methods and the data of the active and inactive antibacterial compounds from the ChEMBL database. The results showed that both models have excellent prediction performance (mean accuracy >0.9 and mean AUC >0.9 for the two models). We used the predictor to screen potential antibacterial compounds from FDA-approved drugs in the DrugBank database. The screening results showed that 1087 small-molecule drugs have potential antibacterial activity and 154 of them are FDA-approved antibacterial drugs, which accounts for 76.2% of the approved antibacterial drugs collected in this study. Through molecular fingerprint similarity analysis and common substructure analysis, we screened 8 predicted antibacterial small-molecule compounds with novel structures compared with known antibacterial drugs, and 5 of them are widely used in the treatment of various tumors. This study provides a new insight for predicting antibacterial compounds by using approved drugs, the predicted compounds might be used to treat bacterial infections and extend lifespan.
Collapse
Affiliation(s)
- Wen-Xing Li
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, Guangdong, China.,Guangdong Provincial Key Laboratory of Single Cell Technology and Application, Southern Medical University, Guangzhou 510515, Guangdong, China
| | - Xin Tong
- State Key Laboratory of Primate Biomedical Research, Institute of Primate Translational Medicine, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
| | - Peng-Peng Yang
- State Key Laboratory of Primate Biomedical Research, Institute of Primate Translational Medicine, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
| | - Yang Zheng
- State Key Laboratory of Primate Biomedical Research, Institute of Primate Translational Medicine, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
| | - Ji-Hao Liang
- State Key Laboratory of Primate Biomedical Research, Institute of Primate Translational Medicine, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
| | - Gong-Hua Li
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, Yunnan, China
| | - Dahai Liu
- School of Medicine, Foshan University, Foshan 528000, Guangdong, China
| | - Dao-Gang Guan
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Southern Medical University, Guangzhou 510515, Guangdong, China.,Guangdong Provincial Key Laboratory of Single Cell Technology and Application, Southern Medical University, Guangzhou 510515, Guangdong, China
| | - Shao-Xing Dai
- State Key Laboratory of Primate Biomedical Research, Institute of Primate Translational Medicine, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
| |
Collapse
|
16
|
Transformational machine learning: Learning how to learn from many related scientific problems. Proc Natl Acad Sci U S A 2021; 118:2108013118. [PMID: 34845013 PMCID: PMC8670494 DOI: 10.1073/pnas.2108013118] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/07/2021] [Indexed: 11/18/2022] Open
Abstract
Machine learning (ML) is the branch of artificial intelligence (AI) that develops computational systems that learn from experience. In supervised ML, the ML system generalizes from labelled examples to learn a model that can predict the labels of unseen examples. Examples are generally represented using features that directly describe the examples. For instance, in drug design, ML uses features that describe molecular shape and so on. In cases where there are multiple related ML problems, it is possible to use a different type of feature: predictions made about the examples by ML models learned on other problems. We call this transformational ML. We show that this results in better predictions and improved understanding when applied to scientific problems. Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).
Collapse
|
17
|
Sabando MV, Ponzoni I, Milios EE, Soto AJ. Using molecular embeddings in QSAR modeling: does it make a difference? Brief Bioinform 2021; 23:6366344. [PMID: 34498670 DOI: 10.1093/bib/bbab365] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 07/29/2021] [Accepted: 08/18/2021] [Indexed: 11/13/2022] Open
Abstract
With the consolidation of deep learning in drug discovery, several novel algorithms for learning molecular representations have been proposed. Despite the interest of the community in developing new methods for learning molecular embeddings and their theoretical benefits, comparing molecular embeddings with each other and with traditional representations is not straightforward, which in turn hinders the process of choosing a suitable representation for Quantitative Structure-Activity Relationship (QSAR) modeling. A reason behind this issue is the difficulty of conducting a fair and thorough comparison of the different existing embedding approaches, which requires numerous experiments on various datasets and training scenarios. To close this gap, we reviewed the literature on methods for molecular embeddings and reproduced three unsupervised and two supervised molecular embedding techniques recently proposed in the literature. We compared these five methods concerning their performance in QSAR scenarios using different classification and regression datasets. We also compared these representations to traditional molecular representations, namely molecular descriptors and fingerprints. As opposed to the expected outcome, our experimental setup consisting of over $25 000$ trained models and statistical tests revealed that the predictive performance using molecular embeddings did not significantly surpass that of traditional representations. Although supervised embeddings yielded competitive results compared with those using traditional molecular representations, unsupervised embeddings tended to perform worse than traditional representations. Our results highlight the need for conducting a careful comparison and analysis of the different embedding techniques prior to using them in drug design tasks and motivate a discussion about the potential of molecular embeddings in computer-aided drug design.
Collapse
Affiliation(s)
| | - Ignacio Ponzoni
- Institute for Computer Science and Engineering, UNS-CONICET, Bahía Blanca, Argentina.,Department of Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina
| | | | - Axel J Soto
- Institute for Computer Science and Engineering, UNS-CONICET, Bahía Blanca, Argentina.,Department of Computer Science and Engineering, Universidad Nacional del Sur, Bahía Blanca, Argentina
| |
Collapse
|
18
|
Khanal J, Tayara H, Zou Q, Chong KT. Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation. Comput Struct Biotechnol J 2021; 19:1612-1619. [PMID: 33868598 PMCID: PMC8042287 DOI: 10.1016/j.csbj.2021.03.015] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/12/2021] [Accepted: 03/13/2021] [Indexed: 12/11/2022] Open
Abstract
DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea.,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea
| |
Collapse
|
19
|
Majumdar S, Nandi SK, Ghosal S, Ghosh B, Mallik W, Roy ND, Biswas A, Mukherjee S, Pal S, Bhattacharyya N. Deep Learning-Based Potential Ligand Prediction Framework for COVID-19 with Drug-Target Interaction Model. Cognit Comput 2021:1-13. [PMID: 33552306 PMCID: PMC7852055 DOI: 10.1007/s12559-021-09840-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Accepted: 01/15/2021] [Indexed: 11/11/2022]
Abstract
To fight against the present pandemic scenario of COVID-19 outbreak, medication with drugs and vaccines is extremely essential other than ventilation support. In this paper, we present a list of ligands which are expected to have the highest binding affinity with the S-glycoprotein of 2019-nCoV and thus can be used to make the drug for the novel coronavirus. Here, we implemented an architecture using 1D convolutional networks to predict drug-target interaction (DTI) values. The network was trained on the KIBA (Kinase Inhibitor Bioactivity) dataset. With this network, we predicted the KIBA scores (which gives a measure of binding affinity) of a list of ligands against the S-glycoprotein of 2019-nCoV. Based on these KIBA scores, we are proposing a list of ligands (33 top ligands based on best interactions) which have a high binding affinity with the S-glycoprotein of 2019-nCoV and thus can be used for the formation of drugs.
Collapse
Affiliation(s)
- Shatadru Majumdar
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
| | - Soumik Kumar Nandi
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
| | - Shuvam Ghosal
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
| | - Bavrabi Ghosh
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
| | - Writam Mallik
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
| | - Nilanjana Dutta Roy
- Department of Computer Science and Engineering, Institute of Engineering and Management, Kolkata, India
| | - Arindam Biswas
- Department of Information Technology, Indian Institute of Engineering Science and Technology, Shibpur, India
| | - Subhankar Mukherjee
- Agri and Environmental Electronics (AEE), Centre for Development of Advanced Computing, Kolkata, India
| | - Souvik Pal
- Agri and Environmental Electronics (AEE), Centre for Development of Advanced Computing, Kolkata, India
| | - Nabarun Bhattacharyya
- Agri and Environmental Electronics (AEE), Centre for Development of Advanced Computing, Kolkata, India
| |
Collapse
|
20
|
Oztekin A, Karagoz K, Adem S, Comakli V. Enhancing bactericidal strategy with selected aromatic compounds: in vitro and in silico study. J Biomol Struct Dyn 2021; 40:5547-5555. [DOI: 10.1080/07391102.2021.1871864] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Affiliation(s)
- Aykut Oztekin
- Medical Services and Techniques Department, Vocational School of Health Services, Agri Ibrahim Cecen University, Agri, Turkey
| | - Kenan Karagoz
- Molecular Biology and Genetics Department, Faculty of Science and Literature, Agri Ibrahim Cecen University, Agri, Turkey
| | - Sevki Adem
- Department of Chemistry, Faculty of Science, Cankiri Karatekin University, Cankiri, Turkey
| | - Veysel Comakli
- Nutrition and Dietetics Department, High School of Health, Agri Ibrahim Cecen University, Agri, Turkey
| |
Collapse
|
21
|
Predicting Drug-Drug Interactions from Heterogeneous Data: An Embedding Approach. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-77211-6_28] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
22
|
Özçelik R, Öztürk H, Özgür A, Ozkirimli E. ChemBoost: A Chemical Language Based Approach for Protein - Ligand Binding Affinity Prediction. Mol Inform 2020; 40:e2000212. [PMID: 33225594 DOI: 10.1002/minf.202000212] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Accepted: 11/20/2020] [Indexed: 11/07/2022]
Abstract
Identification of high affinity drug-target interactions is a major research question in drug discovery. Proteins are generally represented by their structures or sequences. However, structures are available only for a small subset of biomolecules and sequence similarity is not always correlated with functional similarity. We propose ChemBoost, a chemical language based approach for affinity prediction using SMILES syntax. We hypothesize that SMILES is a codified language and ligands are documents composed of chemical words. These documents can be used to learn chemical word vectors that represent words in similar contexts with similar vectors. In ChemBoost, the ligands are represented via chemical word embeddings, while the proteins are represented through sequence-based features and/or chemical words of their ligands. Our aim is to process the patterns in SMILES as a language to predict protein-ligand affinity, even when we cannot infer the function from the sequence. We used eXtreme Gradient Boosting to predict protein-ligand affinities in KIBA and BindingDB data sets. ChemBoost was able to predict drug-target binding affinity as well as or better than state-of-the-art machine learning systems. When powered with ligand-centric representations, ChemBoost was more robust to the changes in protein sequence similarity and successfully captured the interactions between a protein and a ligand, even if the protein has low sequence similarity to the known targets of the ligand.
Collapse
Affiliation(s)
- Rıza Özçelik
- Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey
| | - Hakime Öztürk
- Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, Istanbul, Turkey
| | - Elif Ozkirimli
- Department of Chemical Engineering, Boğaziçi University, Istanbul, Turkey.,Data and Analytics Chapter, Pharma International Informatics, F. Hoffmann-La Roche AG, Switzerland
| |
Collapse
|
23
|
Wu S, Liu C, Feng J, Yang A, Guo F, Qiao J. QSIdb: quorum sensing interference molecules. Brief Bioinform 2020; 22:5916938. [PMID: 33003203 DOI: 10.1093/bib/bbaa218] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Revised: 08/15/2020] [Accepted: 08/17/2020] [Indexed: 12/30/2022] Open
Abstract
Quorum sensing interference (QSI), the disruption and manipulation of quorum sensing (QS) in the dynamic control of bacteria populations could be widely applied in synthetic biology to realize dynamic metabolic control and develop potential clinical therapies. Conventionally, limited QSI molecules (QSIMs) were developed based on molecular structures or for specific QS receptors, which are in short supply for various interferences and manipulations of QS systems. In this study, we developed QSIdb (http://qsidb.lbci.net/), a specialized repository of 633 reported QSIMs and 73 073 expanded QSIMs including both QS agonists and antagonists. We have collected all reported QSIMs in literatures focused on the modifications of N-acyl homoserine lactones, natural QSIMs and synthetic QS analogues. Moreover, we developed a pipeline with SMILES-based similarity assessment algorithms and docking-based validations to mine potential QSIMs from existing 138 805 608 compounds in the PubChem database. In addition, we proposed a new measure, pocketedit, for assessing the similarities of active protein pockets or QSIMs crosstalk, and obtained 273 possible potential broad-spectrum QSIMs. We provided user-friendly browsing and searching facilities for easy data retrieval and comparison. QSIdb could assist the scientific community in understanding QS-related therapeutics, manipulating QS-based genetic circuits in metabolic engineering, developing potential broad-spectrum QSIMs and expanding new ligands for other receptors.
Collapse
Affiliation(s)
- Shengbo Wu
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Chunjiang Liu
- State Key Laboratory of Chemical Engineering, Tianjin University, Tianjin, China
| | - Jie Feng
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Aidong Yang
- Department of Engineering Science, University of Oxford, Oxford, UK
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jianjun Qiao
- Key Laboratory of Systems Bioengineering, Ministry of Education (Tianjin University) and Frontiers Science Center for Synthetic Biology (Ministry of Education), Tianjin University, Tianjin, China
| |
Collapse
|
24
|
Öztürk H, Özgür A, Schwaller P, Laino T, Ozkirimli E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov Today 2020; 25:689-705. [DOI: 10.1016/j.drudis.2020.01.020] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 12/20/2019] [Accepted: 01/28/2020] [Indexed: 01/06/2023]
|
25
|
Le NQK, Huynh TT. Identifying SNAREs by Incorporating Deep Learning Architecture and Amino Acid Embedding Representation. Front Physiol 2019; 10:1501. [PMID: 31920706 PMCID: PMC6914855 DOI: 10.3389/fphys.2019.01501] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2019] [Accepted: 11/26/2019] [Indexed: 12/12/2022] Open
Abstract
SNAREs (soluble N-ethylmaleimide-sensitive factor activating protein receptors) are a group of proteins that are crucial for membrane fusion and exocytosis of neurotransmitters from the cell. They play an important role in a broad range of cell processes, including cell growth, cytokinesis, and synaptic transmission, to promote cell membrane integration in eukaryotes. Many studies determined that SNARE proteins have been associated with a lot of human diseases, especially in cancer. Therefore, identifying their functions is a challenging problem for scientists to better understand the cancer disease as well as design the drug targets for treatment. We described each protein sequence based on the amino acid embeddings using fastText, which is a natural language processing model performing well in its field. Because each protein sequence is similar to a sentence with different words, applying language model into protein sequence is challenging and promising. After generating, the amino acid embedding features were fed into a deep learning algorithm for prediction. Our model which combines fastText model and deep convolutional neural networks could identify SNARE proteins with an independent test accuracy of 92.8%, sensitivity of 88.5%, specificity of 97%, and Matthews correlation coefficient (MCC) of 0.86. Our performance results were superior to the state-of-the-art predictor (SNARE-CNN). We suggest this study as a reliable method for biologists for SNARE identification and it serves a basis for applying fastText word embedding model into bioinformatics, especially in protein sequencing prediction.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | - Tuan-Tu Huynh
- Department of Electrical Electronic and Mechanical Engineering, Lac Hong University, Bien Hoa, Vietnam
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
| |
Collapse
|
26
|
Le NQK, Yapp EKY, Nagasundaram N, Yeh HY. Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams. Front Bioeng Biotechnol 2019; 7:305. [PMID: 31750297 PMCID: PMC6848157 DOI: 10.3389/fbioe.2019.00305] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 10/17/2019] [Indexed: 01/16/2023] Open
Abstract
A promoter is a short region of DNA (100-1,000 bp) where transcription of a gene by RNA polymerase begins. It is typically located directly upstream or at the 5' end of the transcription initiation site. DNA promoter has been proven to be the primary cause of many human diseases, especially diabetes, cancer, or Huntington's disease. Therefore, classifying promoters has become an interesting problem and it has attracted the attention of a lot of researchers in the bioinformatics field. There were a variety of studies conducted to resolve this problem, however, their performance results still require further improvement. In this study, we will present an innovative approach by interpreting DNA sequences as a combination of continuous FastText N-grams, which are then fed into a deep neural network in order to classify them. Our approach is able to attain a cross-validation accuracy of 85.41 and 73.1% in the two layers, respectively. Our results outperformed the state-of-the-art methods on the same dataset, especially in the second layer (strength classification). Throughout this study, promoter regions could be identified with high accuracy and it provides analysis for further biological research as well as precision medicine. In addition, this study opens new paths for the natural language processing application in omics data in general and DNA sequences in particular.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, Taipei Medical University, Taipei, Taiwan
| | | | - N. Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
27
|
iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule. Mol Genet Genomics 2019; 294:1173-1182. [DOI: 10.1007/s00438-019-01570-y] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2019] [Accepted: 04/25/2019] [Indexed: 12/21/2022]
|
28
|
Le NQK, Yapp EKY, Ho QT, Nagasundaram N, Ou YY, Yeh HY. iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. Anal Biochem 2019; 571:53-61. [PMID: 30822398 DOI: 10.1016/j.ab.2019.02.017] [Citation(s) in RCA: 77] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 12/22/2022]
Abstract
An enhancer is a short (50-1500bp) region of DNA that plays an important role in gene expression and the production of RNA and proteins. Genetic variation in enhancers has been linked to many human diseases, such as cancer, disorder or inflammatory bowel disease. Due to the importance of enhancers in genomics, the classification of enhancers has become a popular area of research in computational biology. Despite the few computational tools employed to address this problem, their resulting performance still requires improvements. In this study, we treat enhancers by the word embeddings, including sub-word information of its biological words, which then serve as features to be fed into a support vector machine algorithm to classify them. We present iEnhancer-5Step, a web server containing two-layer classifiers to identify enhancers and their strength. We are able to attain an independent test accuracy of 79% and 63.5% in the two layers, respectively. Compared to current predictors on the same dataset, our proposed method is able to yield superior performance as compared to the other methods. Moreover, this study provides a basis for further research that can enrich the field of applying natural language processing techniques in biological sequences. iEnhancer-5Step is freely accessible via http://biologydeep.com/fastenc/.
Collapse
Affiliation(s)
- Nguyen Quoc Khanh Le
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore.
| | - Edward Kien Yee Yapp
- Singapore Institute of Manufacturing Technology, 2 Fusionopolis Way, #08-04, Innovis, 138634, Singapore
| | - Quang-Thai Ho
- Department of Computer Science and Engineering, Yuan Ze University, 32003, Taiwan
| | - N Nagasundaram
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, 32003, Taiwan
| | - Hui-Yuan Yeh
- Medical Humanities Research Cluster, School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 639798, Singapore.
| |
Collapse
|
29
|
Abstract
Motivation The identification of novel drug-target (DT) interactions is a substantial part of the drug discovery process. Most of the computational methods that have been proposed to predict DT interactions have focused on binary classification, where the goal is to determine whether a DT pair interacts or not. However, protein-ligand interactions assume a continuum of binding strength values, also called binding affinity and predicting this value still remains a challenge. The increase in the affinity data available in DT knowledge-bases allows the use of advanced learning techniques such as deep learning architectures in the prediction of binding affinities. In this study, we propose a deep-learning based model that uses only sequence information of both targets and drugs to predict DT interaction binding affinities. The few studies that focus on DT binding affinity prediction use either 3D structures of protein-ligand complexes or 2D features of compounds. One novel approach used in this work is the modeling of protein sequences and compound 1D representations with convolutional neural networks (CNNs). Results The results show that the proposed deep learning based model that uses the 1D representations of targets and drugs is an effective approach for drug target binding affinity prediction. The model in which high-level representations of a drug and a target are constructed via CNNs achieved the best Concordance Index (CI) performance in one of our larger benchmark datasets, outperforming the KronRLS algorithm and SimBoost, a state-of-the-art method for DT binding affinity prediction. Availability and implementation https://github.com/hkmztrk/DeepDTA. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hakime Öztürk
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| | - Elif Ozkirimli
- Department of Chemical Engineering, Bogazici University, Istanbul, Turkey
| |
Collapse
|