1
|
Liu M, Wang K, Zhang Y, Zhou X, Li W, Han W. Mechanistic Study of Protein Interaction with Natto Inhibitory Peptides Targeting Xanthine Oxidase: Insights from Machine Learning and Molecular Dynamics Simulations. J Chem Inf Model 2025; 65:3682-3696. [PMID: 40125929 DOI: 10.1021/acs.jcim.5c00126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2025]
Abstract
Bioactive peptides from food sources offer a safe and biocompatible approach to enzyme inhibition, with potential applications in managing metabolic disorders such as hyperuricemia and gout, conditions linked to excessive xanthine oxidase activity. Using a machine learning-based screening approach inspired by the bioactivity of natto, two peptides, ECFK and FECK, were identified from the Bacillus subtilis proteome and validated as xanthine oxidase inhibitors with IC50 values of 37.36 and 71.57 mM, respectively. Further experiments confirmed their safety through cytotoxicity assays, and electronic tongue analysis demonstrated their mild sensory properties, supporting their edibility. Molecular dynamics simulations revealed that these peptides stabilize critical enzyme regions, with ECFK showing a higher dissociation energy barrier (52.08 kcal/mol) than FECK (46.39 kcal/mol), indicating strong, stable interactions. This study highlights food-derived peptides as safe and natural inhibitors of xanthine oxidase, offering promising therapeutic potential for metabolic disorder management.
Collapse
Affiliation(s)
- Minghao Liu
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun 130012, China
| | - Kaiyu Wang
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun 130012, China
| | - Yan Zhang
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun 130012, China
| | - Xue Zhou
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun 130012, China
| | - Wannan Li
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun 130012, China
| | - Weiwei Han
- Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Science, Jilin University, 2699 Qianjin Street, Changchun 130012, China
| |
Collapse
|
2
|
Ünlü A, Ulusoy E, Yiğit MG, Darcan M, Doğan T. Protein language models for predicting drug-target interactions: Novel approaches, emerging methods, and future directions. Curr Opin Struct Biol 2025; 91:103017. [PMID: 39985946 DOI: 10.1016/j.sbi.2025.103017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 01/28/2025] [Accepted: 01/29/2025] [Indexed: 02/24/2025]
Abstract
Identifying new drug candidates remains a critical and complex challenge in drug development. Recent advances in deep learning have demonstrated significant potential to accelerate this process, particularly through the use of protein language models (pLMs). These models aim to effectively capture the structural and functional properties of proteins by embedding them in high-dimensional spaces, thereby providing powerful tools for predictive tasks. This review examines the application of pLMs in drug-target interaction (DTI) prediction, addressing both small-molecule and protein-based therapeutics. We explore diverse methodologies, including end-to-end learning models and those that leverage pre-trained foundational pLMs. Furthermore, we highlight the role of heterogeneous data integration-ranging from protein structures to knowledge graphs-to improve the accuracy of DTI predictions. Despite notable progress, challenges persist in accurately identifying DTIs, mainly due to data-related limitations and algorithmic constraints. Future research directions include utilising multimodal learning approaches, incorporating temporal/dynamic interaction data into training, and employing novel deep learning architectures to refine protein representations, gain a deeper understanding of biological context regarding molecular interactions, and, thus, advance the DTI prediction field.
Collapse
Affiliation(s)
- Atabey Ünlü
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, 06800, Ankara, Türkiye
| | - Erva Ulusoy
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, 06800, Ankara, Türkiye
| | - Melih Gökay Yiğit
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Computer Engineering, Middle East Technical University, 06800, Ankara, Türkiye
| | - Melih Darcan
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye
| | - Tunca Doğan
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Health Informatics, Institute of Informatics, Hacettepe University, 06800, Ankara, Türkiye.
| |
Collapse
|
3
|
Shao Y, Liu T. iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model. Comput Struct Biotechnol J 2025; 27:1350-1358. [PMID: 40235638 PMCID: PMC11999076 DOI: 10.1016/j.csbj.2025.03.043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 03/15/2025] [Accepted: 03/26/2025] [Indexed: 04/17/2025] Open
Abstract
Non-classical secreted proteins (NCSPs) are a class of proteins lacking signal peptides, secreted by Gram-positive bacteria through non-classical secretion pathways. With the increasing demand for highly secreted proteins in recent years, non-classical secretion pathways have received more attention due to their advantages over classical secretion pathways (Sec/Tat). However, because the mechanisms of non-classical secretion pathways are not yet clear, identifying NCSPs through biological experiments is expensive and time-consuming, making it imperative to develop computational methods to address this issue. Existing NCSP prediction methods mainly use traditional handcrafted features to represent proteins from sequence information, which limits the models' ability to capture complex protein characteristics. In this study, we proposed a novel NCSP predictor, iNClassSec-ESM, which combined deep learning with traditional classifiers to enhance prediction performance. iNClassSec-ESM integrates an XGBoost model trained on comprehensive handcrafted features and a Deep Neural Network (DNN) trained on hidden layer embeddings from the protein language model (PLM) ESM3. The ESM3 is the recently proposed multimodal PLM and has not yet been fully explored in terms of protein representation. Therefore, we extracted hidden layer embeddings from ESM3 as inputs for multiple classifiers and deep learning networks, and compared them with existing PLMs. Benchmark experiments indicate that iNClassSec-ESM outperforms most of existing methods across multiple performance metrics and could serve as an effective tool for discovering potential NCSPs. Additionally, the ESM3 hidden layer embeddings, as an innovative protein representation method, show great potential for the application in broader protein-related classification tasks. The source code of iNClassSec-ESM and the ESM3 embeddings extraction script are publicly available at https://github.com/AmamiyaHoshie/iNClassSec-ESM/.
Collapse
Affiliation(s)
- Yizhou Shao
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai, 201306, China
| |
Collapse
|
4
|
Du Z, Fu W, Guo X, Caragea D, Li Y. FusionESP: Improved Enzyme-Substrate Pair Prediction by Fusing Protein and Chemical Knowledge. J Chem Inf Model 2025; 65:2806-2817. [PMID: 40035691 DOI: 10.1021/acs.jcim.4c02357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
To reduce the cost of the experimental characterization of the potential substrates for enzymes, machine learning prediction models offer an alternative solution. Pretrained language models, as powerful approaches for protein and molecule representation, have been employed in the development of enzyme-substrate prediction models, achieving promising performance. In addition to continuing improvements in language models, effectively fusing encoders to handle multimodal prediction tasks is critical for further enhancing model performance by using available representation methods. Here, we present FusionESP, a multimodal architecture that integrates protein and chemistry language models with two independent projection heads and a contrastive learning strategy for predicting enzyme-substrate pairs. Our best model achieved state-of-the-art performance with an accuracy of 94.77% on independent test data and exhibited better generalization capacity while requiring fewer computational resources and training data, compared to previous studies of a fine-tuned encoder or employing more encoders. It also confirmed our hypothesis that embeddings of positive pairs are closer to each other in a high-dimension space, while negative pairs exhibit the opposite trend. Our ablation studies showed that the projection heads played a crucial role in performance enhancement, while the contrastive learning strategy further improved the projection heads' capacity in classification tasks. The proposed architecture is expected to be further applied to enhance performance in additional multimodality prediction tasks in biology. A user-friendly web server of FusionESP is established and freely accessible at https://rqkjkgpsyu.us-east-1.awsapprunner.com/.
Collapse
Affiliation(s)
- Zhenjiao Du
- Department of Grain Science and Industry, Kansas State University, Manhattan, Kansas 66506, United States
| | - Weimin Fu
- Department of Electrical and Computer Engineering, Kansas State University, Manhattan, Kansas 66506, United States
| | - Xiaolong Guo
- Department of Electrical and Computer Engineering, Kansas State University, Manhattan, Kansas 66506, United States
| | - Doina Caragea
- Department of Computer Science, Kansas State University, Manhattan, Kansas 66506, United States
| | - Yonghui Li
- Department of Grain Science and Industry, Kansas State University, Manhattan, Kansas 66506, United States
| |
Collapse
|
5
|
NaderiAlizadeh N, Singh R. Aggregating residue-level protein language model embeddings with optimal transport. BIOINFORMATICS ADVANCES 2025; 5:vbaf060. [PMID: 40170888 PMCID: PMC11961220 DOI: 10.1093/bioadv/vbaf060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 02/13/2025] [Accepted: 03/17/2025] [Indexed: 04/03/2025]
Abstract
Motivation Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable for various applications. As protein representation schemes, PLMs generate per-token (i.e. per-residue) representations, resulting in variable-sized outputs based on protein length. This variability poses a challenge for protein-level prediction tasks that require uniform-sized embeddings for consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information across token-level representations. Results We introduce a novel method utilizing optimal transport to convert variable-length PLM outputs into fixed-length representations. We conceptualize per-token PLM outputs as samples from a probabilistic distribution and employ sliced-Wasserstein distances to map these samples against a reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. We demonstrate the superiority of our method over average pooling for several downstream prediction tasks, particularly with constrained PLM sizes, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Our aggregation scheme is especially effective for longer protein sequences by capturing essential information that might be lost through average pooling. Availability and implementation Our implementation code can be found at https://github.com/navid-naderi/PLM_SWE.
Collapse
Affiliation(s)
- Navid NaderiAlizadeh
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27705, United States
| | - Rohit Singh
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27705, United States
- Department of Cell Biology, Duke University, Durham, NC 27705, United States
| |
Collapse
|
6
|
Ullanat V, Jing B, Sledzieski S, Berger B. Learning the language of protein-protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.09.642188. [PMID: 40166198 PMCID: PMC11956943 DOI: 10.1101/2025.03.09.642188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Protein Language Models (PLMs) trained on large databases of protein sequences have proven effective in modeling protein biology across a wide range of applications. However, while PLMs excel at capturing individual protein properties, they face challenges in natively representing protein-protein interactions (PPIs), which are crucial to understanding cellular processes and disease mechanisms. Here, we introduce MINT, a PLM specifically designed to model sets of interacting proteins in a contextual and scalable manner. Using unsupervised training on a large curated PPI dataset derived from the STRING database, MINT outperforms existing PLMs in diverse tasks relating to protein-protein interactions, including binding affinity prediction and estimation of mutational effects. Beyond these core capabilities, it excels at modeling interactions in complex protein assemblies and surpasses specialized models in antibody-antigen modeling and T cell receptor-epitope binding prediction. MINT's predictions of mutational impacts on oncogenic PPIs align with experimental studies, and it provides reliable estimates for the potential for cross-neutralization of antibodies against SARS-CoV-2 variants of concern. These findings position MINT as a powerful tool for elucidating complex protein interactions, with significant implications for biomedical research and therapeutic discovery.
Collapse
Affiliation(s)
- Varun Ullanat
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
| | - Bowen Jing
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
| | - Samuel Sledzieski
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
- Center for Computational Biology, Flatiron Insitute, New York, NY
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
- Department of Mathematics, Massachusetts Institute of Technology, MA
| |
Collapse
|
7
|
Zhou F, Zhang S, Zhang H, Liu JK. ProCeSa: Contrast-Enhanced Structure-Aware Network for Thermostability Prediction with Protein Language Models. J Chem Inf Model 2025; 65:2304-2313. [PMID: 39988825 PMCID: PMC11898056 DOI: 10.1021/acs.jcim.4c01752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Revised: 02/14/2025] [Accepted: 02/17/2025] [Indexed: 02/25/2025]
Abstract
Proteins play a fundamental role in biology, and their thermostability is essential for their proper functionality. The precise measurement of thermostability is crucial, traditionally relying on resource-intensive experiments. Recent advances in deep learning, particularly in protein language models (PLMs), have significantly accelerated the progress in protein thermostability prediction. These models utilize various biological characteristics or deep representations generated by PLMs to represent the protein sequences. However, effectively incorporating structural information, based on the PLM embeddings, while not considering atomic protein structures, remains an open and formidable challenge. Here, we propose a novel Protein Contrast-enhanced Structure-Aware (ProCeSa) model that seamlessly integrates both sequence and structural information extracted from PLMs to enhance thermostability prediction. Our model employs a contrastive learning scheme guided by the categories of amino acid residues, allowing it to discern intricate patterns within protein sequences. Rigorous experiments conducted on publicly available data sets establish the superiority of our method over state-of-the-art approaches, excelling in both classification and regression tasks. Our results demonstrate that ProCeSa addresses the complex challenge of predicting protein thermostability by utilizing PLM-derived sequence embeddings, without requiring access to atomic structural data.
Collapse
Affiliation(s)
| | - Shuo Zhang
- School
of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.
| | | | - Jian K. Liu
- School
of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.
| |
Collapse
|
8
|
Yin J, Zhang H, Sun X, You N, Mou M, Lu M, Pan Z, Li F, Li H, Zeng S, Zhu F. Decoding Drug Response With Structurized Gridding Map-Based Cell Representation. IEEE J Biomed Health Inform 2025; 29:1702-1713. [PMID: 38090819 DOI: 10.1109/jbhi.2023.3342280] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/08/2025]
Abstract
A thorough understanding of cell-line drug response mechanisms is crucial for drug development, repurposing, and resistance reversal. While targeted anticancer therapies have shown promise, not all cancers have well-established biomarkers to stratify drug response. Single-gene associations only explain a small fraction of the observed drug sensitivity, so a more comprehensive method is needed. However, while deep learning models have shown promise in predicting drug response in cell lines, they still face significant challenges when it comes to their application in clinical applications. Therefore, this study proposed a new strategy called DD-Response for cell-line drug response prediction. First, a limitation of narrow modeling horizons was overcome to expand the model training domain by integrating multiple datasets through source-specific label binarization. Second, a modified representation based on a two-dimensional structurized gridding map (SGM) was developed for cell lines & drugs, avoiding feature correlation neglect and potential information loss. Third, a dual-branch, multi-channel convolutional neural network-based model for pairwise response prediction was constructed, enabling accurate outcomes and improved exploration of underlying mechanisms. As a result, the DD-Response demonstrated superior performance, captured cell-line characteristic variations, and provided insights into key factors impacting cell-line drug response. In addition, DD-Response exhibited scalability in predicting clinical patient responses to drug therapy. Overall, because of DD-response's excellent ability to predict drug response and capture key molecules behind them, DD-response is expected to greatly facilitate drug discovery, repurposing, resistance reversal, and therapeutic optimization.
Collapse
|
9
|
Wen S, Han Y, Li Y, Zhan D. Therapeutic Mechanisms of Medicine Food Homology Plants in Alzheimer's Disease: Insights from Network Pharmacology, Machine Learning, and Molecular Docking. Int J Mol Sci 2025; 26:2121. [PMID: 40076742 PMCID: PMC11899993 DOI: 10.3390/ijms26052121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Revised: 02/21/2025] [Accepted: 02/24/2025] [Indexed: 03/14/2025] Open
Abstract
Alzheimer's disease (AD) is a progressive neurodegenerative disorder characterized by a gradual decline in cognitive function. Currently, there are no effective treatments for this condition. Medicine food homology plants have gained increasing attention as potential natural treatments for AD because of their nutritional value and therapeutic benefits. In this work, we aimed to provide a deeper understanding of how medicine food homology plants may help alleviate or potentially treat AD by identifying key targets, pathways, and small molecule compounds from 10 medicine food homology plants that play an important role in this process. Using network pharmacology, we identified 623 common targets between AD and the compounds from the selected 10 plants, including crucial proteins such as STAT3, IL6, TNF, and IL1B. Additionally, the small molecules from the selected plants were grouped into four clusters using hierarchical clustering. The ConPlex algorithm was then applied to predict the binding capabilities of these small molecules to the key protein targets. Cluster 3 showed superior predicted binding capabilities to STAT3, TNF, and IL1B, which was further validated by molecular docking. Scaffold analysis of small molecules in Cluster 3 revealed that those with a steroid-like core-comprising three fused six-membered rings and one five-membered ring with a carbon-carbon double bond-exhibited better predicted binding affinities and were potential triple-target inhibitors. Among them, MOL005439, MOL000953, and MOL005438 were identified as the top-performing compounds. This study highlights the potential of medicine food homology plants as a source of active compounds that could be developed into new drugs for AD treatment. However, further pharmacokinetic studies are essential to assess their efficacy and minimize side effects.
Collapse
Affiliation(s)
- Shuran Wen
- College of Food Science and Engineering, Jilin Agricultural University, 2888 Xincheng Street, Changchun 130118, China;
| | - Ye Han
- College of Plant Protection, Jilin Agricultural University, 2888 Xincheng Street, Changchun 130118, China;
| | - You Li
- College of Life Science, Jilin Agricultural University, 2888 Xincheng Street, Changchun 130118, China;
| | - Dongling Zhan
- College of Food Science and Engineering, Jilin Agricultural University, 2888 Xincheng Street, Changchun 130118, China;
| |
Collapse
|
10
|
Ge L, Gao Q, He J, Wang X, Huang J, Zhang H, Qin Z. MultiT2: A Tool Connecting the Multimodal Data for Bacterial Aromatic Polyketide Natural Products. ACS OMEGA 2025; 10:5105-5110. [PMID: 39959056 PMCID: PMC11822507 DOI: 10.1021/acsomega.4c11266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2024] [Revised: 01/15/2025] [Accepted: 01/23/2025] [Indexed: 02/18/2025]
Abstract
The integration of artificial intelligence (AI) into natural product science is an exciting and rapidly evolving area of research. By combining classical chemistry and biology with deep learning, these technologies have significantly improved research efficiency, particularly in overcoming laborious and time-consuming processes. Recently, there has been growing interest in leveraging multimodal algorithms to integrate biologically relevant yet mathematically disparate data sets in order to reorganize knowledge graphs. However, to the best of our knowledge, no studies have yet applied this approach specifically within the natural product field. This is largely because correlating multimodal natural product data is challenging due to their high degree of fragmentation. Here, we present MultiT2, an algorithm that connects these disparate data from bacterial aromatic polyketides, which form a medically important natural product family, as a showcase. Through a large-scale causal inference process, this approach aims to transcend mere prediction, unlocking new dimensions in the natural product discovery and research domains.
Collapse
Affiliation(s)
| | | | | | - Xiaoyu Wang
- Center for Biological Science
and Technology, Advanced Institute of Natural Sciences, Beijing Normal University, Zhuhai, Guangdong 519087, China
| | - Jiaquan Huang
- Center for Biological Science
and Technology, Advanced Institute of Natural Sciences, Beijing Normal University, Zhuhai, Guangdong 519087, China
| | - Heqian Zhang
- Center for Biological Science
and Technology, Advanced Institute of Natural Sciences, Beijing Normal University, Zhuhai, Guangdong 519087, China
| | - Zhiwei Qin
- Center for Biological Science
and Technology, Advanced Institute of Natural Sciences, Beijing Normal University, Zhuhai, Guangdong 519087, China
| |
Collapse
|
11
|
Creanza TM, Alberga D, Patruno C, Mangiatordi GF, Ancona N. Transformer Decoder Learns from a Pretrained Protein Language Model to Generate Ligands with High Affinity. J Chem Inf Model 2025; 65:1258-1277. [PMID: 39871540 DOI: 10.1021/acs.jcim.4c02019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2025]
Abstract
The drug discovery process can be significantly accelerated by using deep learning methods to suggest molecules with druglike features and, more importantly, that are good candidates to bind specific proteins of interest. We present a novel deep learning generative model, Prot2Drug, that learns to generate ligands binding specific targets leveraging (i) the information carried by a pretrained protein language model and (ii) the ability of transformers to capitalize the knowledge gathered from thousands of protein-ligand interactions. The embedding unveils the receipt to follow for designing molecules binding a given protein, and Prot2Drug translates such instructions by using the syntax of the molecular language generating novel compounds which are predicted to have favorable physicochemical properties and high affinity toward specific targets. Moreover, Prot2Drug reproduced numerous known interactions between compounds and proteins used for generating them and suggested novel protein targets for known compounds, indicating potential drug repurposing strategies. Remarkably, Prot2Drug facilitates the design of promising ligands even for protein targets with limited or no information about their ligands or 3D structure.
Collapse
Affiliation(s)
- Teresa Maria Creanza
- Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| | - Domenico Alberga
- Institute of Crystallography, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| | - Cosimo Patruno
- Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| | | | - Nicola Ancona
- Institute of Intelligent Industrial Technologies and Systems for Advanced Manufacturing, Consiglio Nazionale delle Ricerche, Via G. Amendola, 122/d, Bari 70126, Italy
| |
Collapse
|
12
|
Talo M, Bozdag S. Top-DTI: Integrating Topological Deep Learning and Large Language Models for Drug Target Interaction Prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.07.637146. [PMID: 39975019 PMCID: PMC11839103 DOI: 10.1101/2025.02.07.637146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Motivation The accurate prediction of drug-target interactions (DTI) is a crucial step in drug discovery, providing a foundation for identifying novel therapeutics. Traditional drug development is both costly and time-consuming, often spanning over a decade. Computational approaches help narrow the pool of compound candidates, offering significant starting points for experimental validation. In this study, we propose Top-DTI framework for predicting DTI by integrating topological data analysis (TDA) with large language models (LLMs). Top-DTI leverages persistent homology to extract topological features from protein contact maps and drug molecular images. Simultaneously, protein and drug LLMs generate semantically rich embeddings that capture sequential and contextual information from protein sequences and drug SMILES strings. By combining these complementary features, Top-DTI enhances predictive performance and robustness. Results Experimental results on the public BioSNAP and Human DTI benchmark datasets demonstrate that the proposed Top-DTI model outperforms state-of-the-art approaches across multiple evaluation metrics, including AUROC, AUPRC, sensitivity, and specificity. Furthermore, the Top-DTI model achieves superior performance in the challenging cold-split scenario, where the test and validation sets contain drugs or targets absent from the training set. This setting simulates real-world scenarios and highlights the robustness of the model. Notably, incorporating topological features alongside LLM embeddings significantly improves predictive performance, underscoring the value of integrating structural and sequence-based representations. Availability The data and source code of Top-DTI is available at https://github.com/bozdaglab/Top_DTI under Creative Commons Attribution Non Commercial 4.0 International Public License.
Collapse
Affiliation(s)
- Muhammed Talo
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76207, USA
- BioDiscovery Institute, University of North Texas, Denton, TX 76207, USA
- Center for Computational Life Sciences, University of North Texas, Denton, TX 76207, USA
| | - Serdar Bozdag
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76207, USA
- BioDiscovery Institute, University of North Texas, Denton, TX 76207, USA
- Center for Computational Life Sciences, University of North Texas, Denton, TX 76207, USA
- Department of Mathematics, University of North Texas, Denton, TX 76207, USA
| |
Collapse
|
13
|
Schuh MG, Boldini D, Bohne AI, Sieber SA. Barlow Twins deep neural network for advanced 1D drug-target interaction prediction. J Cheminform 2025; 17:18. [PMID: 39910404 PMCID: PMC11800607 DOI: 10.1186/s13321-025-00952-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Accepted: 01/08/2025] [Indexed: 02/07/2025] Open
Abstract
Accurate prediction of drug-target interactions is critical for advancing drug discovery. By reducing time and cost, machine learning and deep learning can accelerate this laborious discovery process. In a novel approach, BarlowDTI, we utilise the powerful Barlow Twins architecture for feature-extraction while considering the structure of the target protein. Our method achieves state-of-the-art predictive performance against multiple established benchmarks using only one-dimensional input. The use of our hybrid approach of deep learning and gradient boosting machine as the underlying predictor ensures fast and efficient predictions without the need for substantial computational resources. We also propose the use of an influence method to investigate how the model reaches its decision based on individual training samples. By comparing co-crystal structures, we find that BarlowDTI effectively exploits catalytically active and stabilising residues, highlighting the model's ability to generalise from one-dimensional input data. In addition, we further benchmark new baselines against existing methods. Together, these innovations improve the efficiency and effectiveness of drug-target interactions predictions, providing robust tools for accelerating drug development and deepening the understanding of molecular interactions. Therefore, we provide an easy-to-use web interface that can be freely accessed at https://www.bio.nat.tum.de/oc2/barlowdti . SCIENTIFIC CONTRIBUTION: Our computationally efficient and effective hybrid approach, combining the deep learning model Barlow Twins and gradient boosting machines, outperforms state-of-the-art methods across multiple splits and benchmarks using only one-dimensional input. Furthermore, we advance the field by proposing an influence method that elucidates model decision-making, thereby providing deeper insights into molecular interactions and improving the interpretability of drug-target interactions predictions.
Collapse
Affiliation(s)
- Maximilian G Schuh
- Chair of Organic Chemistry II, Department of Bioscience, TUM School of Natural Sciences, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Ernst-Otto-Fischer Str. 8, 85748, Garching bei München, Bavaria, Germany
| | - Davide Boldini
- Chair of Organic Chemistry II, Department of Bioscience, TUM School of Natural Sciences, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Ernst-Otto-Fischer Str. 8, 85748, Garching bei München, Bavaria, Germany.
| | - Annkathrin I Bohne
- Chair of Biochemistry, Department of Bioscience, TUM School of Natural Sciences, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Ernst-Otto-Fischer Str. 8, 85748, Garching bei München, Bavaria, Germany
| | - Stephan A Sieber
- Chair of Organic Chemistry II, Department of Bioscience, TUM School of Natural Sciences, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Ernst-Otto-Fischer Str. 8, 85748, Garching bei München, Bavaria, Germany.
| |
Collapse
|
14
|
Peng Y, Wu J, Sun Y, Zhang Y, Wang Q, Shao S. Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction. Nat Commun 2025; 16:1299. [PMID: 39900608 PMCID: PMC11791096 DOI: 10.1038/s41467-025-56526-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Accepted: 01/15/2025] [Indexed: 02/05/2025] Open
Abstract
Identifying and characterizing virulence proteins secreted by Gram-negative bacteria are fundamental for deciphering microbial pathogenicity as well as aiding the development of therapeutic strategies. Effector predictors utilizing pre-trained protein language models (PLMs) have shown sound performance by leveraging extensive evolutionary and sequential protein features. However, the accuracy and sensitivity of effector prediction remain challenging. Here, we introduce a model named Contrastive-learning of Language Embedding and Biological Features (CLEF) leveraging contrastive learning to integrate PLM representations with supplementary biological features. Biologically information is captured in learned contextualized embeddings to yield meaningful representations. With cross-modality biological features, CLEF outperforms state-of-the-art (SOTA) models in predicting type III, type IV, and type VI secreted effectors (T3SEs/T4SEs/T6SEs) in enteric pathogens. All experimentally verified effectors in Enterohemorrhagic Escherichia coli and 41 of 43 experimentally verified T3SEs of Salmonella Typhimurium are recognized. Moreover, 12 predicted T3SEs and 11 predicted T6SEs are validated by extensive experiments in Edwardsiella piscicida. Furthermore, integrating omics data via CLEF framework enhances protein representations to illustrate effector-effector interactions and determine in vivo colonization-essential genes. Collectively, CLEF provides a blueprint to bridge the gap between in silico PLM's capacity and experimental biological information to fulfill complicated tasks.
Collapse
Affiliation(s)
- Yue Peng
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Junze Wu
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Yi Sun
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
| | - Yuanxing Zhang
- Southern Marine Science and Engineering Guangdong Laboratory (Zhuhai), 519000, Zhuhai, China
| | - Qiyao Wang
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China
- Shanghai Engineering Research Center of Maricultured Animal Vaccines, Shanghai, China
- Laboratory of Aquatic Animal Diseases of MOA, Shanghai, China
| | - Shuai Shao
- State Key Laboratory of Bioreactor Engineering, East China University of Science and Technology, Shanghai, China.
- Shanghai Engineering Research Center of Maricultured Animal Vaccines, Shanghai, China.
- Laboratory of Aquatic Animal Diseases of MOA, Shanghai, China.
| |
Collapse
|
15
|
Yoon MS, Bae B, Kim K, Park H, Baek M. Deep learning methods for proteome-scale interaction prediction. Curr Opin Struct Biol 2025; 90:102981. [PMID: 39848140 DOI: 10.1016/j.sbi.2024.102981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 11/13/2024] [Accepted: 12/22/2024] [Indexed: 01/25/2025]
Abstract
Proteome-scale interaction prediction is essential for understanding protein functions and disease mechanisms. Traditional experimental methods are often limited by scale and complexity, driving the need for computational approaches. Deep learning has emerged as a powerful tool, enabling high-throughput, accurate predictions of protein interactions. This review highlights recent advances in deep learning methods for protein-protein and protein-ligand interaction screening, along with datasets used for model training. Despite the progress with deep learning, challenges such as data quality and validation biases remain. We also discuss the increasing importance of integrating structural information to enhance prediction accuracy and how structure-based deep learning approaches can help overcome current limitations, ultimately advancing biological research and drug discovery.
Collapse
Affiliation(s)
- Min Su Yoon
- Department of Biological Sciences, Seoul National University, Seoul 08826, Republic of Korea
| | - Byunghyun Bae
- Department of Chemistry, Seoul National University, Seoul 08826, Republic of Korea; Biomedical Research Division, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea
| | - Kunhee Kim
- Department of Biological Sciences, Seoul National University, Seoul 08826, Republic of Korea
| | - Hahnbeom Park
- Biomedical Research Division, Korea Institute of Science and Technology, Seoul 02792, Republic of Korea; KIST-SKKU Brain Research Center, SKKU Institute for Convergence, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| | - Minkyung Baek
- Department of Biological Sciences, Seoul National University, Seoul 08826, Republic of Korea.
| |
Collapse
|
16
|
McNutt AT, Adduri AK, Ellington CN, Dayao MT, Xing EP, Mohimani H, Koes DR. Scaling Structure Aware Virtual Screening to Billions of Molecules with SPRINT. ARXIV 2025:arXiv:2411.15418v2. [PMID: 39975427 PMCID: PMC11838698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Virtual screening of small molecules against protein targets can accelerate drug discovery and development by predicting drug-target interactions (DTIs). However, structure-based methods like molecular docking are too slow to allow for broad proteome-scale screens, limiting their application in screening for off-target effects or new molecular mechanisms. Recently, vector-based methods using protein language models (PLMs) have emerged as a complementary approach that bypasses explicit 3D structure modeling. Here, we develop SPRINT, a vector-based approach for screening entire chemical libraries against whole proteomes for DTIs and novel mechanisms of action. SPRINT improves on prior work by using a self-attention based architecture and structure-aware PLMs to learn a co-embedding space for drugs and targets, enabling efficient binder prediction, search, and retrieval. SPRINT achieves SOTA enrichment factors in virtual screening on LIT-PCBA, DTI classification benchmarks, and binding affinity prediction benchmarks, while providing interpretability in the form of residue-level attention maps. In addition to being both accurate and interpretable, SPRINT is ultra-fast: querying the whole human proteome against the ENAMINE Real Database (6.7B drugs) for the 100 most likely binders per protein takes 16 minutes. SPRINT promises to enable virtual screening at an unprecedented scale, opening up new opportunities for in silico drug repurposing and development. SPRINT is available on the web as ColabScreen: https://bit.ly/colab-screen.
Collapse
Affiliation(s)
- Andrew T. McNutt
- Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Abhinav K. Adduri
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | | | - Monica T. Dayao
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Eric P. Xing
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Mohamed Bin Zayed University of Artificial Intelligence, Masdar City, Abu Dhabi
- Petuum Inc., Pittsburgh, PA
| | - Hosein Mohimani
- Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - David R. Koes
- Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
17
|
Adduri AK, McNutt AT, Ellington CN, Suraparaju K, Fang N, Yan D, Krummenacher B, Li S, Bodden C, Xing EP, Behsaz B, Koes D, Mohimani H. Interpretable adenylation domain specificity prediction using protein language models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.13.632878. [PMID: 39868251 PMCID: PMC11761653 DOI: 10.1101/2025.01.13.632878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Natural products have long been a rich source of diverse and clinically effective drug candidates. Non-ribosomal peptides (NRPs), polyketides (PKs), and NRP-PK hybrids are three classes of natural products that display a broad range of bioactivities, including antibiotic, antifungal, anticancer, and immunosuppressant activities. However, discovering these compounds through traditional bioactivity-guided techniques is costly and time-consuming, often resulting in the rediscovery of known molecules. Consequently, genome mining has emerged as a high-throughput strategy to screen hundreds of thousands of microbial genomes to identify their potential to produce novel natural products. Adenylation domains play a key role in the biosynthesis of NRPs and NRP-PKs by recruiting substrates to incrementally build the final structure. We propose MASPR, a machine learning method that leverages protein language models for accurate and interpretable predictions of A-domain substrate specificities. MASPR demonstrates superior accuracy and generalization over existing methods and is capable of predicting substrates not present in its training data, or zero-shot classification. We use MASPR to develop Seq2Hybrid, an efficient algorithm to predict the structure of hybrid NRP-PK natural products from microbial genomes. Using Seq2Hybrid, we propose putative biosynthetic gene clusters for the orphan natural products Octaminomycin A, Dityromycin, SW-163B, and JBIR-39.
Collapse
Affiliation(s)
- Abhinav K Adduri
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Andrew T McNutt
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Caleb N Ellington
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Krish Suraparaju
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Nan Fang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Donghui Yan
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Benjamin Krummenacher
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Sitong Li
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Camilla Bodden
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Eric P Xing
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
- Department of Machine Learning, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Bahar Behsaz
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - David Koes
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Hosein Mohimani
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
18
|
Tanoli Z, Schulman A, Aittokallio T. Validation guidelines for drug-target prediction methods. Expert Opin Drug Discov 2025; 20:31-45. [PMID: 39568436 DOI: 10.1080/17460441.2024.2430955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 11/14/2024] [Indexed: 11/22/2024]
Abstract
INTRODUCTION Mapping the interactions between pharmaceutical compounds and their molecular targets is a fundamental aspect of drug discovery and repurposing. Drug-target interactions are important for elucidating mechanisms of action and optimizing drug efficacy and safety profiles. Several computational methods have been developed to systematically predict drug-target interactions. However, computational and experimental validation of the drug-target predictions greatly vary across the studies. AREAS COVERED Through a PubMed query, a corpus comprising 3,286 articles on drug-target interaction prediction published within the past decade was covered. Natural language processing was used for automated abstract classification to study the evolution of computational methods, validation strategies and performance assessment metrics in the 3,286 articles. Additionally, a manual analysis of 259 studies that performed experimental validation of computational predictions revealed prevalent experimental protocols. EXPERT OPINION Starting from 2014, there has been a noticeable increase in articles focusing on drug-target interaction prediction. Docking and regression stands out as the most commonly used techniques among computational methods, and cross-validation is frequently employed as the computational validation strategy. Testing the predictions using multiple, orthogonal validation strategies is recommended and should be reported for the specific target prediction applications. Experimental validation remains relatively rare and should be performed more routinely to evaluate biological relevance of predictions.
Collapse
Affiliation(s)
- Ziaurrehman Tanoli
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland
- iCAN Digital Precision Cancer Medicine Flagship, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
| | - Aron Schulman
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland
| | - Tero Aittokallio
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, Finland
- iCAN Digital Precision Cancer Medicine Flagship, University of Helsinki and Helsinki University Hospital, Helsinki, Finland
- Institute for Cancer Research, Department of Cancer Genetics, Oslo University Hospital, Oslo, Norway
- Oslo Centre for Biostatistics and Epidemiology (OCBE), Faculty of Medicine, University of Oslo, Oslo, Norway
| |
Collapse
|
19
|
Ouyang X, Feng Y, Cui C, Li Y, Zhang L, Wang H. Improving generalizability of drug-target binding prediction by pre-trained multi-view molecular representations. Bioinformatics 2024; 41:btaf002. [PMID: 39776159 PMCID: PMC11751634 DOI: 10.1093/bioinformatics/btaf002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 12/12/2024] [Accepted: 01/06/2025] [Indexed: 01/11/2025] Open
Abstract
MOTIVATION Most drugs start on their journey inside the body by binding the right target proteins. This is the reason that numerous efforts have been devoted to predicting the drug-target binding during drug development. However, the inherent diversity among molecular properties, coupled with limited training data availability, poses challenges to the accuracy and generalizability of these methods beyond their training domain. RESULTS In this work, we proposed a neural networks construction for high accurate and generalizable drug-target binding prediction, named Pre-trained Multi-view Molecular Representations (PMMR). The method uses pre-trained models to transfer representations of target proteins and drugs to the domain of drug-target binding prediction, mitigating the issue of poor generalizability stemming from limited data. Then, two typical representations of drug molecules, Graphs and SMILES strings, are learned respectively by a Graph Neural Network and a Transformer to achieve complementarity between local and global features. PMMR was evaluated on drug-target affinity and interaction benchmark datasets, and it derived preponderant performance contrast to peer methods, especially generalizability in cold-start scenarios. Furthermore, our state-of-the-art method was indicated to have the potential for drug discovery by a case study of cyclin-dependent kinase 2. AVAILABILITY AND IMPLEMENTATION https://github.com/NENUBioCompute/PMMR.
Collapse
Affiliation(s)
- Xike Ouyang
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, Jilin 130117, China
| | - Yannuo Feng
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, Jilin 130117, China
| | - Chen Cui
- School of Computer Science and Engineering, Changchun University of Technology, Changchun, Jilin 130051, China
| | - Yunhe Li
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, Jilin 130117, China
| | - Li Zhang
- School of Computer Science and Engineering, Changchun University of Technology, Changchun, Jilin 130051, China
| | - Han Wang
- School of Information Science and Technology, Institute of Computational Biology, Northeast Normal University, Changchun, Jilin 130117, China
| |
Collapse
|
20
|
Liu XH, Lu ZH, Wang T, Liu F. Large language models facilitating modern molecular biology and novel drug development. Front Pharmacol 2024; 15:1458739. [PMID: 39776586 PMCID: PMC11703923 DOI: 10.3389/fphar.2024.1458739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Accepted: 12/05/2024] [Indexed: 01/11/2025] Open
Abstract
The latest breakthroughs in information technology and biotechnology have catalyzed a revolutionary shift within the modern healthcare landscape, with notable impacts from artificial intelligence (AI) and deep learning (DL). Particularly noteworthy is the adept application of large language models (LLMs), which enable seamless and efficient communication between scientific researchers and AI systems. These models capitalize on neural network (NN) architectures that demonstrate proficiency in natural language processing, thereby enhancing interactions. This comprehensive review outlines the cutting-edge advancements in the application of LLMs within the pharmaceutical industry, particularly in drug development. It offers a detailed exploration of the core mechanisms that drive these models and zeroes in on the practical applications of several models that show great promise in this domain. Additionally, this review delves into the pivotal technical and ethical challenges that arise with the practical implementation of LLMs. There is an expectation that LLMs will assume a more pivotal role in the development of innovative drugs and will ultimately contribute to the accelerated development of revolutionary pharmaceuticals.
Collapse
Affiliation(s)
- Xiao-huan Liu
- School of Biological Science, Jining Medical University, Jining, China
| | - Zhen-hua Lu
- College of Chemical and Biological Engineering, Zhejiang University, Hangzhou, China
| | - Tao Wang
- School of Biological Science, Jining Medical University, Jining, China
| | - Fei Liu
- School of Biological Science, Jining Medical University, Jining, China
| |
Collapse
|
21
|
Heinzinger M, Weissenow K, Sanchez J, Henkel A, Mirdita M, Steinegger M, Rost B. Bilingual language model for protein sequence and structure. NAR Genom Bioinform 2024; 6:lqae150. [PMID: 39633723 PMCID: PMC11616678 DOI: 10.1093/nargab/lqae150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Revised: 08/02/2024] [Accepted: 10/21/2024] [Indexed: 12/07/2024] Open
Abstract
Adapting language models to protein sequences spawned the development of powerful protein language models (pLMs). Concurrently, AlphaFold2 broke through in protein structure prediction. Now we can systematically and comprehensively explore the dual nature of proteins that act and exist as three-dimensional (3D) machines and evolve as linear strings of one-dimensional (1D) sequences. Here, we leverage pLMs to simultaneously model both modalities in a single model. We encode protein structures as token sequences using the 3Di-alphabet introduced by the 3D-alignment method Foldseek. For training, we built a non-redundant dataset from AlphaFoldDB and fine-tuned an existing pLM (ProtT5) to translate between 3Di and amino acid sequences. As a proof-of-concept for our novel approach, dubbed Protein 'structure-sequence' T5 (ProstT5), we showed improved performance for subsequent, structure-related prediction tasks, leading to three orders of magnitude speedup for deriving 3Di. This will be crucial for future applications trying to search metagenomic sequence databases at the sensitivity of structure comparisons. Our work showcased the potential of pLMs to tap into the information-rich protein structure revolution fueled by AlphaFold2. ProstT5 paves the way to develop new tools integrating the vast resource of 3D predictions and opens new research avenues in the post-AlphaFold2 era.
Collapse
Affiliation(s)
- Michael Heinzinger
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany
| | - Konstantin Weissenow
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany
| | - Joaquin Gomez Sanchez
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany
| | - Adrian Henkel
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany
| | - Milot Mirdita
- School of Biological Sciences, Seoul National University, 08826 Seoul, South Korea
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, 08826 Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, 08826 Seoul, South Korea
- Institute of Molecular Biology and Genetics, Seoul National University, 08826 Seoul, South Korea
| | - Burkhard Rost
- School of Computation, Information, and Technology (CIT), Department of Informatics, Bioinformatics & Computational Biology, TUM (Technical University of Munich), 85748 Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr, 2a, 85748 Garching/Munich, Germany & TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
22
|
Yu X, Zhou S, Zang M, Wang Q, Liu C, Liu T. Parallel Convolutional Contrastive Learning Method for Enzyme Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2604-2609. [PMID: 39167509 DOI: 10.1109/tcbb.2024.3447037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/23/2024]
Abstract
The function labeling of enzymes has a wide range of application value in the medical field, industrial biology and other fields. Scientists define enzyme categories by enzyme commission (EC) numbers. At present, although there are some tools for enzyme function prediction, their effects have not reached the application level. To improve the precision of enzyme function prediction, we propose a parallel convolutional contrastive learning (PCCL) method to predict enzyme functions. First, we use the advanced protein language model ESM-2 to preprocess the protein sequences. Second, PCCL combines convolutional neural networks (CNNs) and contrastive learning to improve the prediction precision of multifunctional enzymes. Contrastive learning can make the model better deal with the problem of class imbalance. Finally, the deep learning framework is mainly composed of three parallel CNNs for fully extracting sample features. we compare PCCL with state-of-art enzyme function prediction methods based on three evaluation metrics. The performance of our model improves on both two test sets. Especially on the smaller test set, PCCL improves the AUC by 2.57%.
Collapse
|
23
|
Christensson G, Bocci M, Kazi JU, Durand G, Lanzing G, Pietras K, Gonzalez Velozo H, Hagerling C. Spatial Multiomics Reveals Intratumoral Immune Heterogeneity with Distinct Cytokine Networks in Lung Cancer Brain Metastases. CANCER RESEARCH COMMUNICATIONS 2024; 4:2888-2902. [PMID: 39400127 PMCID: PMC11539001 DOI: 10.1158/2767-9764.crc-24-0201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 09/06/2024] [Accepted: 10/09/2024] [Indexed: 10/15/2024]
Abstract
The tumor microenvironment of brain metastases has become a focus in the development of immunotherapeutic drugs. However, countless patients with brain metastasis have not experienced clinical benefit. Thus, understanding the immune cell composition within brain metastases and how immune cells interact with each other and other microenvironmental cell types may be critical for optimizing immunotherapy. We applied spatial whole-transcriptomic profiling with extensive multiregional sampling (19-30 regions per sample) and multiplex IHC on formalin-fixed, paraffin-embedded lung cancer brain metastasis samples. We performed deconvolution of gene expression data to infer the abundances of immune cell populations and inferred spatial relationships from the multiplex IHC data. We also described cytokine networks between immune and tumor cells and used a protein language model to predict drug-target interactions. Finally, we performed deconvolution of bulk RNA data to assess the prognostic significance of immune-metastatic tumor cellular networks. We show that immune cell infiltration has a negative prognostic role in lung cancer brain metastases. Our in-depth multiomics analyses further reveal recurring intratumoral immune heterogeneity and the segregation of myeloid and lymphoid cells into distinct compartments that may be influenced by distinct cytokine networks. By using computational modeling, we identify drugs that may target genes expressed in both tumor core and regions bordering immune infiltrates. Finally, we illustrate the potential negative prognostic role of our immune-metastatic tumor cell networks. Our findings advocate for a paradigm shift from focusing on individual genes or cell types toward targeting networks of immune and tumor cells. SIGNIFICANCE Immune cell signatures are conserved across lung cancer brain metastases, and immune-metastatic tumor cell networks have a prognostic effect, implying that targeting cytokine networks between immune and metastatic tumor cells may generate more precise immunotherapeutic approaches.
Collapse
Affiliation(s)
- Gustav Christensson
- Department of Experimental Medical Science, Lund University, Lund, Sweden
- Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden
| | - Matteo Bocci
- Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden
- Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, Sweden
| | - Julhash U. Kazi
- Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden
- Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, Sweden
| | - Geoffroy Durand
- Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden
- Division of Clinical Genetics, Department of Laboratory Medicine, Lund University, Lund, Sweden
| | - Gustav Lanzing
- Department of Experimental Medical Science, Lund University, Lund, Sweden
- Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden
| | - Kristian Pietras
- Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden
- Division of Translational Cancer Research, Department of Laboratory Medicine, Lund University, Lund, Sweden
| | - Hugo Gonzalez Velozo
- Department of Anatomy, University of California, San Francisco, San Francisco, California
- Laboratory of Tumor Microenvironment and Metastasis, Centro Ciencia & Vida, Santiago, Chile
| | - Catharina Hagerling
- Department of Experimental Medical Science, Lund University, Lund, Sweden
- Lund University Cancer Centre (LUCC), Lund University, Lund, Sweden
| |
Collapse
|
24
|
Liu Y, Xia X, Gong Y, Song B, Zeng X. SSR-DTA: Substructure-aware multi-layer graph neural networks for drug-target binding affinity prediction. Artif Intell Med 2024; 157:102983. [PMID: 39321746 DOI: 10.1016/j.artmed.2024.102983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 09/10/2024] [Accepted: 09/13/2024] [Indexed: 09/27/2024]
Abstract
Accurate prediction of drug-target binding affinity (DTA) is essential in the field of drug discovery. Recently, scientists have been attempting to utilize artificial intelligence prediction to screen out a significant number of ineffective compounds, thereby mitigating labor and financial losses. While graph neural networks (GNNs) have been applied to DTA, existing GNNs have limitations in effectively extracting substructural features across various sizes. Functional groups play a crucial role in modulating molecular properties, but existing GNNs struggle with feature extraction from certain motifs due to scale mismatches. Additionally, sequence-based models for target proteins lack the integration of structural information. To address these limitations, we present SSR-DTA, a multi-layer graph network capable of adapting to diverse structural sizes, which can extract richer biological features, thereby improving the robustness and accuracy of predictions. Multi-layer GNNs enable the capture of molecular motifs across different scales, ranging from atomic to macrocyclic motifs. Furthermore, we introduce BiGNN to simultaneously learn sequence and structural information. Sequence information corresponds to the primary structure of proteins, while graph information represents the tertiary structure. BiGNN assimilates richer information compared to sequence-based methods while mitigating the impact of errors from predicted structures, resulting in more accurate predictions. Through rigorous experimental evaluations conducted on four benchmark datasets, we demonstrate the superiority of SSR-DTA over state-of-the-art models. Particularly, in comparison to state-of-the-art models, SSR-DTA demonstrates an impressive 20% reduction in mean squared error on the Davis dataset and a 5% reduction on the KIBA dataset, underscoring its potential as a valuable tool for advancing DTA prediction.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China; Key Laboratory of Intelligent Computing & Signal Processing of Ministry of Education, Anhui University, Hefei, 230601, Anhui, China
| | - Xinyan Xia
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China
| | - Yongshun Gong
- School of Software, Shandong University, Jinan, 250100, Shandong, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410086, Hunan, China.
| |
Collapse
|
25
|
Ghislat G, Hernandez-Hernandez S, Piyawajanusorn C, Ballester PJ. Data-centric challenges with the application and adoption of artificial intelligence for drug discovery. Expert Opin Drug Discov 2024; 19:1297-1307. [PMID: 39316009 DOI: 10.1080/17460441.2024.2403639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 09/09/2024] [Indexed: 09/25/2024]
Abstract
INTRODUCTION Artificial intelligence (AI) is exhibiting tremendous potential to reduce the massive costs and long timescales of drug discovery. There are however important challenges currently limiting the impact and scope of AI models. AREAS COVERED In this perspective, the authors discuss a range of data issues (bias, inconsistency, skewness, irrelevance, small size, high dimensionality), how they challenge AI models, and which issue-specific mitigations have been effective. Next, they point out the challenges faced by uncertainty quantification techniques aimed at enhancing and trusting the predictions from these AI models. They also discuss how conceptual errors, unrealistic benchmarks and performance misestimation can confound the evaluation of models and thus their development. Lastly, the authors explain how human bias, whether from AI experts or drug discovery experts, constitutes another challenge that can be alleviated by gaining more prospective experience. EXPERT OPINION AI models are often developed to excel on retrospective benchmarks unlikely to anticipate their prospective performance. As a result, only a few of these models are ever reported to have prospective value (e.g. by discovering potent and innovative drug leads for a therapeutic target). The authors have discussed what can go wrong in practice with AI for drug discovery. The authors hope that this will help inform the decisions of editors, funders investors, and researchers working in this area.
Collapse
Affiliation(s)
- Ghita Ghislat
- Department of Life Sciences, Imperial College London, London, UK
| | | | | | | |
Collapse
|
26
|
Li Y, Zhang X, Chen Z, Yang H, Liu Y, Wang H, Yan T, Xiang J, Wang B. Accurate prediction of drug-target interactions in Chinese and western medicine by the CWI-DTI model. Sci Rep 2024; 14:25054. [PMID: 39443630 PMCID: PMC11499656 DOI: 10.1038/s41598-024-76367-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 10/14/2024] [Indexed: 10/25/2024] Open
Abstract
Accurate prediction of drug-target interactions (DTIs) is crucial for advancing drug discovery and repurposing. Computational methods have significantly improved the efficiency of experimental predictions for drug-target interactions in Western medicine. However, accurately predicting the complex relationships between Chinese medicine ingredients and targets remains a formidable challenge due to the vast number and high heterogeneity of these ingredients. In this study, we introduce the CWI-DTI method, which achieves high-accuracy prediction of DTIs using a large dataset of interactive relationships of drug ingredients or candidate targets. Moreover, we present a novel dataset to evaluate the prediction accuracy of both Chinese and Western medicine. Through meticulous collection and preprocessing of data on ingredients and targets, we employ an innovative autoencoder framework to fuse multiple drug (target) topological similarity matrices. Additionally, we employ denoising blocks, sparse blocks, and stacked blocks to extract crucial features from the similarity matrix, reducing noise and enhancing accuracy across diverse datasets. Our results indicate that the CWI-DTI model shows improved performance compared to several existing state-of-the-art methods on the datasets tested in both Western and Chinese medicine databases. The findings of this study hold immense promise for advancing DTI prediction in Chinese and Western medicine, thus fostering more efficient drug discovery and repurposing endeavors. Our model is available at https://github.com/WANG-BIN-LAB/CWIDTI .
Collapse
Affiliation(s)
- Ying Li
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China
| | - Xingyu Zhang
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China
| | - Zhuo Chen
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China
| | - Hongye Yang
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China
| | - Yuhui Liu
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China
| | - Huiqing Wang
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China
| | - Ting Yan
- Department of Pathology, Shanxi Key Laboratory of Carcinogenesis and Translational Research on Esophageal Cancer, Shanxi Medical University, Taiyuan, China
| | - Jie Xiang
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China
| | - Bin Wang
- Department of Computer Science and Technology, Taiyuan University of Technology, Taiyuan, China.
| |
Collapse
|
27
|
Henderson J, Nagano Y, Milighetti M, Tiffeau-Mayer A. Limits on inferring T cell specificity from partial information. Proc Natl Acad Sci U S A 2024; 121:e2408696121. [PMID: 39374400 PMCID: PMC11494314 DOI: 10.1073/pnas.2408696121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 09/03/2024] [Indexed: 10/09/2024] Open
Abstract
A key challenge in molecular biology is to decipher the mapping of protein sequence to function. To perform this mapping requires the identification of sequence features most informative about function. Here, we quantify the amount of information (in bits) that T cell receptor (TCR) sequence features provide about antigen specificity. We identify informative features by their degree of conservation among antigen-specific receptors relative to null expectations. We find that TCR specificity synergistically depends on the hypervariable regions of both receptor chains, with a degree of synergy that strongly depends on the ligand. Using a coincidence-based approach to measuring information enables us to directly bound the accuracy with which TCR specificity can be predicted from partial matches to reference sequences. We anticipate that our statistical framework will be of use for developing machine learning models for TCR specificity prediction and for optimizing TCRs for cell therapies. The proposed coincidence-based information measures might find further applications in bounding the performance of pairwise classifiers in other fields.
Collapse
Affiliation(s)
- James Henderson
- Division of Infection and Immunity, University College London, LondonWC1E 6BT, United Kingdom
- Institute for the Physics of Living Systems, University College London, LondonWC1E 6BT, United Kingdom
| | - Yuta Nagano
- Division of Infection and Immunity, University College London, LondonWC1E 6BT, United Kingdom
- Division of Medicine, University College London, LondonWC1E 6BT, United Kingdom
| | - Martina Milighetti
- Division of Infection and Immunity, University College London, LondonWC1E 6BT, United Kingdom
- Cancer Institute, University College London, LondonWC1E 6DD, United Kingdom
| | - Andreas Tiffeau-Mayer
- Division of Infection and Immunity, University College London, LondonWC1E 6BT, United Kingdom
- Institute for the Physics of Living Systems, University College London, LondonWC1E 6BT, United Kingdom
| |
Collapse
|
28
|
Qiao G, Wang G, Li Y. Causal enhanced drug-target interaction prediction based on graph generation and multi-source information fusion. Bioinformatics 2024; 40:btae570. [PMID: 39312682 PMCID: PMC11639159 DOI: 10.1093/bioinformatics/btae570] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 08/17/2024] [Accepted: 09/20/2024] [Indexed: 09/25/2024] Open
Abstract
MOTIVATION The prediction of drug-target interaction is a vital task in the biomedical field, aiding in the discovery of potential molecular targets of drugs and the development of targeted therapy methods with higher efficacy and fewer side effects. Although there are various methods for drug-target interaction (DTI) prediction based on heterogeneous information networks, these methods face challenges in capturing the fundamental interaction between drugs and targets and ensuring the interpretability of the model. Moreover, they need to construct meta-paths artificially or a lot of feature engineering (prior knowledge), and graph generation can fuse information more flexibly without meta-path selection. RESULTS We propose a causal enhanced method for drug-target interaction (CE-DTI) prediction that integrates graph generation and multi-source information fusion. First, we represent drugs and targets by modeling the fusion of their multi-source information through automatic graph generation. Once drugs and targets are combined, a network of drug-target pairs is constructed, transforming the prediction of drug-target interactions into a node classification problem. Specifically, the influence of surrounding nodes on the central node is separated into two groups: causal and non-causal variable nodes. Causal variable nodes significantly impact the central node's classification, while non-causal variable nodes do not. Causal invariance is then used to enhance the contrastive learning of the drug-target pairs network. Our method demonstrates excellent performance compared with other competitive benchmark methods across multiple datasets. At the same time, the experimental results also show that the causal enhancement strategy can explore the potential causal effects between DTPs, and discover new potential targets. Additionally, case studies demonstrate that this method can identify potential drug targets. AVAILABILITY AND IMPLEMENTATION The source code of AdaDR is available at: https://github.com/catly/CE-DTI.
Collapse
Affiliation(s)
- Guanyu Qiao
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Guohua Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Yang Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
29
|
Sun D, Macedonia C, Chen Z, Chandrasekaran S, Najarian K, Zhou S, Cernak T, Ellingrod VL, Jagadish HV, Marini B, Pai M, Violi A, Rech JC, Wang S, Li Y, Athey B, Omenn GS. Can Machine Learning Overcome the 95% Failure Rate and Reality that Only 30% of Approved Cancer Drugs Meaningfully Extend Patient Survival? J Med Chem 2024; 67:16035-16055. [PMID: 39253942 DOI: 10.1021/acs.jmedchem.4c01684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Despite implementing hundreds of strategies, cancer drug development suffers from a 95% failure rate over 30 years, with only 30% of approved cancer drugs extending patient survival beyond 2.5 months. Adding more criteria without eliminating nonessential ones is impractical and may fall into the "survivorship bias" trap. Machine learning (ML) models may enhance efficiency by saving time and cost. Yet, they may not improve success rate without identifying the root causes of failure. We propose a "STAR-guided ML system" (structure-tissue/cell selectivity-activity relationship) to enhance success rate and efficiency by addressing three overlooked interdependent factors: potency/specificity to the on/off-targets determining efficacy in tumors at clinical doses, on/off-target-driven tissue/cell selectivity influencing adverse effects in the normal organs at clinical doses, and optimal clinical doses balancing efficacy/safety as determined by potency/specificity and tissue/cell selectivity. STAR-guided ML models can directly predict clinical dose/efficacy/safety from five features to design/select the best drugs, enhancing success and efficiency of cancer drug development.
Collapse
Affiliation(s)
| | | | - Zhigang Chen
- LabBotics.ai, Palo Alto, California 94303, United States
| | | | | | - Simon Zhou
- Aurinia Pharmaceuticals Inc., Rockville, Maryland 20850, United States
| | | | | | | | | | | | | | | | | | - Yan Li
- Translational Medicine and Clinical Pharmacology, Bristol Myers Squibb, Summit, New Jersey 07901, United States
| | | | | |
Collapse
|
30
|
Guichaoua G, Pinel P, Hoffmann B, Azencott CA, Stoven V. Drug-Target Interactions Prediction at Scale: The Komet Algorithm with the LCIdb Dataset. J Chem Inf Model 2024; 64:6938-6956. [PMID: 39237105 PMCID: PMC11423346 DOI: 10.1021/acs.jcim.4c00422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2024]
Abstract
Drug-target interactions (DTIs) prediction algorithms are used at various stages of the drug discovery process. In this context, specific problems such as deorphanization of a new therapeutic target or target identification of a drug candidate arising from phenotypic screens require large-scale predictions across the protein and molecule spaces. DTI prediction heavily relies on supervised learning algorithms that use known DTIs to learn associations between molecule and protein features, allowing for the prediction of new interactions based on learned patterns. The algorithms must be broadly applicable to enable reliable predictions, even in regions of the protein or molecule spaces where data may be scarce. In this paper, we address two key challenges to fulfill these goals: building large, high-quality training datasets and designing prediction methods that can scale, in order to be trained on such large datasets. First, we introduce LCIdb, a curated, large-sized dataset of DTIs, offering extensive coverage of both the molecule and druggable protein spaces. Notably, LCIdb contains a much higher number of molecules than publicly available benchmarks, expanding coverage of the molecule space. Second, we propose Komet (Kronecker Optimized METhod), a DTI prediction pipeline designed for scalability without compromising performance. Komet leverages a three-step framework, incorporating efficient computation choices tailored for large datasets and involving the Nyström approximation. Specifically, Komet employs a Kronecker interaction module for (molecule, protein) pairs, which efficiently captures determinants in DTIs, and whose structure allows for reduced computational complexity and quasi-Newton optimization, ensuring that the model can handle large training sets, without compromising on performance. Our method is implemented in open-source software, leveraging GPU parallel computation for efficiency. We demonstrate the interest of our pipeline on various datasets, showing that Komet displays superior scalability and prediction performance compared to state-of-the-art deep learning approaches. Additionally, we illustrate the generalization properties of Komet by showing its performance on an external dataset, and on the publicly available L H benchmark designed for scaffold hopping problems. Komet is available open source at https://komet.readthedocs.io and all datasets, including LCIdb, can be found at https://zenodo.org/records/10731712.
Collapse
Affiliation(s)
- Gwenn Guichaoua
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| | - Philippe Pinel
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
- Iktos SAS, 75017 Paris, France
| | | | - Chloé-Agathe Azencott
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| | - Véronique Stoven
- Center for Computational Biology (CBIO), Mines Paris-PSL, 75006 Paris, France
- Institut Curie, Université PSL, 75005 Paris, France
- INSERM U900, 75005 Paris, France
| |
Collapse
|
31
|
Jiang X, Tan L, Zou Q. DGCL: dual-graph neural networks contrastive learning for molecular property prediction. Brief Bioinform 2024; 25:bbae474. [PMID: 39331017 PMCID: PMC11428321 DOI: 10.1093/bib/bbae474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 08/16/2024] [Accepted: 09/13/2024] [Indexed: 09/28/2024] Open
Abstract
In this paper, we propose DGCL, a dual-graph neural networks (GNNs)-based contrastive learning (CL) integrated with mixed molecular fingerprints (MFPs) for molecular property prediction. The DGCL-MFP method contains two stages. In the first pretraining stage, we utilize two different GNNs as encoders to construct CL, rather than using the method of generating enhanced graphs as before. Precisely, DGCL aggregates and enhances features of the same molecule by the Graph Isomorphism Network and the Graph Attention Network, with representations extracted from the same molecule serving as positive samples, and others marked as negative ones. In the downstream tasks training stage, features extracted from the two above pretrained graph networks and the meticulously selected MFPs are concated together to predict molecular properties. Our experiments show that DGCL enhances the performance of existing GNNs by achieving or surpassing the state-of-the-art self-supervised learning models on multiple benchmark datasets. Specifically, DGCL increases the average performance of classification tasks by 3.73$\%$ and improves the performance of regression task Lipo by 0.126. Through ablation studies, we validate the impact of network fusion strategies and MFPs on model performance. In addition, DGCL's predictive performance is further enhanced by weighting different molecular features based on the Extended Connectivity Fingerprint. The code and datasets of DGCL will be made publicly available.
Collapse
Affiliation(s)
- Xiuyu Jiang
- School of Computer Science and Engineering, Sun Yat-sen University, Waihuan East Street, Guangzhou 510006, China
| | - Liqin Tan
- School of Computer Science and Engineering, Sun Yat-sen University, Waihuan East Street, Guangzhou 510006, China
| | - Qingsong Zou
- School of Computer Science and Engineering, Sun Yat-sen University, Waihuan East Street, Guangzhou 510006, China
| |
Collapse
|
32
|
Theisen R, Wang T, Ravikumar B, Rahman R, Cichońska A. Leveraging multiple data types for improved compound-kinase bioactivity prediction. Nat Commun 2024; 15:7596. [PMID: 39217147 PMCID: PMC11365929 DOI: 10.1038/s41467-024-52055-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Accepted: 08/21/2024] [Indexed: 09/04/2024] Open
Abstract
Machine learning provides efficient ways to map compound-kinase interactions. However, diverse bioactivity data types, including single-dose and multi-dose-response assay results, present challenges. Traditional models utilize only multi-dose data, overlooking information contained in single-dose measurements. Here, we propose a machine learning methodology for compound-kinase activity prediction that leverages both single-dose and dose-response data. We demonstrate that our two-stage approach yields accurate activity predictions and significantly improves model performance compared to training solely on dose-response labels. This superior performance is consistent across five diverse machine learning methods. Using the best performing model, we carried out extensive experimental profiling on a total of 347 selected compound-kinase pairs, achieving a high hit rate of 40% and a negative predictive value of 78%. We show that these rates can be improved further by incorporating model uncertainty estimates into the compound selection process. By integrating multiple activity data types, we demonstrate that our approach holds promise for facilitating the development of training activity datasets in a more efficient and cost-effective way.
Collapse
Affiliation(s)
- Ryan Theisen
- Harmonic Discovery Inc., New York City, NY, USA.
| | | | | | | | | |
Collapse
|
33
|
Cao A, Zhang L, Bu Y, Sun D. Machine Learning Prediction of On/Off Target-driven Clinical Adverse Events. Pharm Res 2024; 41:1649-1658. [PMID: 39095534 DOI: 10.1007/s11095-024-03742-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 07/06/2024] [Indexed: 08/04/2024]
Abstract
OBJECTIVE Currently, 90% of clinical drug development fails, where 30% of these failures are due to clinical toxicity. The current extensive animal toxicity studies are not predictive of clinical adverse events (AEs) at clinical doses, while current computation models only consider very few factors with limited success in clinical toxicity prediction. We aimed to address these issues by developing a machine learning (ML) model to directly predict clinical AEs. METHODS Using a dataset with 759 FDA-approved drugs with known AEs, we first adapted the ConPLex ML model to predict IC50 values of these FDA-approved drugs against their on-target and off-target binding among 477 protein targets. Subsequently, we constructed a new ML model to predict clinical AEs using IC50 values of 759 drugs' primary on-target and off-target effects along with tissue-specific protein expression profiles. RESULTS The adapted ConPLex model predicted drug-target interactions for both on- and off-target effects, as shown by co-localization of the 6 small molecule kinase inhibitors with their respective kinases. The coupled ML models demonstrated good predictive capability of clinical AEs, with accuracy over 75%. CONCLUSIONS Our approach provides a new insight into the mechanistic understanding of in vivo drug toxicity in relationship with drug on-/off-target interactions. The coupled ML models, once validated with larger datasets, may offer advantages to directly predict clinical AEs using in vitro/ex vivo and preclinical data, which will help to reduce drug development failure due to clinical toxicity.
Collapse
Affiliation(s)
- Albert Cao
- Department of Pharmaceutical Sciences, College of Pharmacy, University of Michigan, Ann Arbor, MI, 48109, United States
- Centennial High School, Ellicott City, MD, 21042, United States
| | - Luchen Zhang
- Department of Pharmaceutical Sciences, College of Pharmacy, University of Michigan, Ann Arbor, MI, 48109, United States
| | - Yingzi Bu
- Department of Pharmaceutical Sciences, College of Pharmacy, University of Michigan, Ann Arbor, MI, 48109, United States
- Michigan Institute for Computational Discovery & Engineering, University of Michigan, Ann Arbor, MI, 48109, United States
| | - Duxin Sun
- Department of Pharmaceutical Sciences, College of Pharmacy, University of Michigan, Ann Arbor, MI, 48109, United States.
- Duxin Sun, 1600 Huron Parkway, North Campus Research Complex, Building 520, Ann Arbor, MI, 48109, United States.
| |
Collapse
|
34
|
Cesnik A, Schaffer LV, Gaur I, Jain M, Ideker T, Lundberg E. Mapping the Multiscale Proteomic Organization of Cellular and Disease Phenotypes. Annu Rev Biomed Data Sci 2024; 7:369-389. [PMID: 38748859 PMCID: PMC11343683 DOI: 10.1146/annurev-biodatasci-102423-113534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/23/2024]
Abstract
While the primary sequences of human proteins have been cataloged for over a decade, determining how these are organized into a dynamic collection of multiprotein assemblies, with structures and functions spanning biological scales, is an ongoing venture. Systematic and data-driven analyses of these higher-order structures are emerging, facilitating the discovery and understanding of cellular phenotypes. At present, knowledge of protein localization and function has been primarily derived from manual annotation and curation in resources such as the Gene Ontology, which are biased toward richly annotated genes in the literature. Here, we envision a future powered by data-driven mapping of protein assemblies. These maps can capture and decode cellular functions through the integration of protein expression, localization, and interaction data across length scales and timescales. In this review, we focus on progress toward constructing integrated cell maps that accelerate the life sciences and translational research.
Collapse
Affiliation(s)
- Anthony Cesnik
- Department of Bioengineering, Stanford University, Stanford, California, USA;
| | - Leah V Schaffer
- Department of Medicine, University of California San Diego, La Jolla, California, USA;
| | - Ishan Gaur
- Department of Bioengineering, Stanford University, Stanford, California, USA;
| | - Mayank Jain
- Department of Medicine, University of California San Diego, La Jolla, California, USA;
| | - Trey Ideker
- Departments of Computer Science and Engineering and Bioengineering, University of California San Diego, La Jolla, California, USA
- Department of Medicine, University of California San Diego, La Jolla, California, USA;
| | - Emma Lundberg
- Chan Zuckerberg Biohub, San Francisco, California, USA
- Science for Life Laboratory, School of Engineering Sciences in Chemistry, Biotechnology and Health, KTH Royal Institute of Technology, Stockholm, Sweden
- Department of Pathology, Stanford University, Palo Alto, California, USA
- Department of Bioengineering, Stanford University, Stanford, California, USA;
| |
Collapse
|
35
|
Piras A, Chenghao S, Sebek M, Ispirova G, Menichetti G. CPIExtract: A software package to collect and harmonize small molecule and protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.03.601957. [PMID: 39005430 PMCID: PMC11245042 DOI: 10.1101/2024.07.03.601957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
The binding interactions between small molecules and proteins are the basis of cellular functions. Yet, experimental data available regarding compound-protein interaction is not harmonized into a single entity but rather scattered across multiple institutions, each maintaining databases with different formats. Extracting information from these multiple sources remains challenging due to data heterogeneity. Here, we present CPIExtract (Compound-Protein Interaction Extract), a tool to interactively extract experimental binding interaction data from multiple databases, perform filtering, and harmonize the resulting information, thus providing a gain of compound-protein interaction data. When compared to a single source, DrugBank, we show that it can collect more than 10 times the amount of annotations. The end-user can apply custom filtering to the aggregated output data and save it in any generic tabular file suitable for further downstream tasks such as network medicine analyses for drug repurposing and cross-validation of deep learning models.
Collapse
Affiliation(s)
- Andrea Piras
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133, Milan, Italy
| | - Shi Chenghao
- Network Science Institute, Northeastern University, 360 Huntington Ave, 02115, MA, USA
| | - Michael Sebek
- Network Science Institute, Northeastern University, 360 Huntington Ave, 02115, MA, USA
| | - Gordana Ispirova
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, 181 Longwood Ave, 02115, MA, USA
| | - Giulia Menichetti
- Network Science Institute, Northeastern University, 360 Huntington Ave, 02115, MA, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, 181 Longwood Ave, 02115, MA, USA
- Harvard Data Science Initiative, Harvard University, 114 Western Avenue, 02134, MA, USA
| |
Collapse
|
36
|
Sledzieski S, Kshirsagar M, Baek M, Dodhia R, Lavista Ferres J, Berger B. Democratizing protein language models with parameter-efficient fine-tuning. Proc Natl Acad Sci U S A 2024; 121:e2405840121. [PMID: 38900798 PMCID: PMC11214071 DOI: 10.1073/pnas.2405840121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 05/09/2024] [Indexed: 06/22/2024] Open
Abstract
Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from large corpora of sequences. These models are typically fine-tuned in a supervised setting to adapt the model to specific downstream tasks. However, the computational and memory footprint of fine-tuning (FT) large PLMs presents a barrier for many research groups with limited computational resources. Natural language processing has seen a similar explosion in the size of models, where these challenges have been addressed by methods for parameter-efficient fine-tuning (PEFT). In this work, we introduce this paradigm to proteomics through leveraging the parameter-efficient method LoRA and training new models for two important tasks: predicting protein-protein interactions (PPIs) and predicting the symmetry of homooligomer quaternary structures. We show that these approaches are competitive with traditional FT while requiring reduced memory and substantially fewer parameters. We additionally show that for the PPI prediction task, training only the classification head also remains competitive with full FT, using five orders of magnitude fewer parameters, and that each of these methods outperform state-of-the-art PPI prediction methods with substantially reduced compute. We further perform a comprehensive evaluation of the hyperparameter space, demonstrate that PEFT of PLMs is robust to variations in these hyperparameters, and elucidate where best practices for PEFT in proteomics differ from those in natural language processing. All our model adaptation and evaluation code is available open-source at https://github.com/microsoft/peft_proteomics. Thus, we provide a blueprint to democratize the power of PLM adaptation to groups with limited computational resources.
Collapse
Affiliation(s)
- Samuel Sledzieski
- AI for Good Research Lab, Microsoft Corporation, Redmond, WA98052
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | | | - Minkyung Baek
- Department of Biological Sciences, Seoul National University, Seoul08826, South Korea
| | - Rahul Dodhia
- AI for Good Research Lab, Microsoft Corporation, Redmond, WA98052
| | | | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA02139
| |
Collapse
|
37
|
Zhou H, Skolnick J. Utility of the Morgan Fingerprint in Structure-Based Virtual Ligand Screening. J Phys Chem B 2024; 128:5363-5370. [PMID: 38783525 PMCID: PMC11163432 DOI: 10.1021/acs.jpcb.4c01875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/10/2024] [Accepted: 05/14/2024] [Indexed: 05/25/2024]
Abstract
In modern drug discovery, virtual ligand screening (VLS) is frequently applied to identify possible hits before experimental testing and refinement due to its cost-effective nature for large compound libraries. For decades, efforts have been devoted to developing VLS methods with high accuracy. These include the state-of-the-art FINDSITE suite of approaches FINDSITEcomb2.0, FRAGSITE, and FRAGSITE2 and the meta version FRAGSITEcomb that were developed in our lab. These methods combine ligand homology modeling (LHM), traditional ligand similarity methods, and more recently machine learning approaches to rank ligands and have proven to be superior to most recent deep learning and large language model-based approaches. Here, we describe further improvements to our previous best methods by combining the Morgan fingerprint (MF) with the originally used PubChem fingerprint and FP2 fingerprint. We then benchmarked FINDSITEcomb2.0M, FRAGSITEM, FRAGSITE2M, and the composite meta-approach FRAGSITEcombM. On the 102 target DUD-E set, the 1% enrichment factor (EF1%) and area under the precision-recall curve (AUPR) of FRAGSITEcomb increased from 42.0/0.59 to 47.6/0.72. This 0.72 AUPR is significantly better than that of the state-of-the-art deep learning-based method DenseFS's AUPR of 0.443. An independent test on the 81 targets DEKOIS2.0 set shows that EF1%/AUPR increases from 18.3/0.520 to 23.1/0.683. An ablation investigation shows that the MF contributes to most of the improvement of all four approaches. Thus, the MF is a useful addition to structure-based VLS.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems
Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| | - Jeffrey Skolnick
- Center for the Study of Systems
Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| |
Collapse
|
38
|
Rao J, Xie J, Yuan Q, Liu D, Wang Z, Lu Y, Zheng S, Yang Y. A variational expectation-maximization framework for balanced multi-scale learning of protein and drug interactions. Nat Commun 2024; 15:4476. [PMID: 38796523 PMCID: PMC11530528 DOI: 10.1038/s41467-024-48801-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 05/14/2024] [Indexed: 05/28/2024] Open
Abstract
Protein functions are characterized by interactions with proteins, drugs, and other biomolecules. Understanding these interactions is essential for deciphering the molecular mechanisms underlying biological processes and developing new therapeutic strategies. Current computational methods mostly predict interactions based on either molecular network or structural information, without integrating them within a unified multi-scale framework. While a few multi-view learning methods are devoted to fusing the multi-scale information, these methods tend to rely intensively on a single scale and under-fitting the others, likely attributed to the imbalanced nature and inherent greediness of multi-scale learning. To alleviate the optimization imbalance, we present MUSE, a multi-scale representation learning framework based on a variant expectation maximization to optimize different scales in an alternating procedure over multiple iterations. This strategy efficiently fuses multi-scale information between atomic structure and molecular network scale through mutual supervision and iterative optimization. MUSE outperforms the current state-of-the-art models not only in molecular interaction (protein-protein, drug-protein, and drug-drug) tasks but also in protein interface prediction at the atomic structure scale. More importantly, the multi-scale learning framework shows potential for extension to other scales of computational drug discovery.
Collapse
Affiliation(s)
- Jiahua Rao
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Jiancong Xie
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Deqin Liu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Zhen Wang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China.
| | - Shuangjia Zheng
- Global Institute of Future Technology, Shanghai Jiao Tong University, Shanghai, China.
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Sun Yat-sen University, Guangzhou, China.
- State Key Laboratory of Oncology in South China, Sun Yat-sen University, Guangzhou, China.
| |
Collapse
|
39
|
Chen H, Bajorath J. Generative design of compounds with desired potency from target protein sequences using a multimodal biochemical language model. J Cheminform 2024; 16:55. [PMID: 38778425 PMCID: PMC11110441 DOI: 10.1186/s13321-024-00852-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 05/09/2024] [Indexed: 05/25/2024] Open
Abstract
Deep learning models adapted from natural language processing offer new opportunities for the prediction of active compounds via machine translation of sequential molecular data representations. For example, chemical language models are often derived for compound string transformation. Moreover, given the principal versatility of language models for translating different types of textual representations, off-the-beaten-path design tasks might be explored. In this work, we have investigated generative design of active compounds with desired potency from target sequence embeddings, representing a rather provoking prediction task. Therefore, a dual-component conditional language model was designed for learning from multimodal data. It comprised a protein language model component for generating target sequence embeddings and a conditional transformer for predicting new active compounds with desired potency. To this end, the designated "biochemical" language model was trained to learn mappings of combined protein sequence and compound potency value embeddings to corresponding compounds, fine-tuned on individual activity classes not encountered during model derivation, and evaluated on compound test sets that were structurally distinct from training sets. The biochemical language model correctly reproduced known compounds with different potency for all activity classes, providing proof-of-concept for the approach. Furthermore, the conditional model consistently reproduced larger numbers of known compounds as well as more potent compounds than an unconditional model, revealing a substantial effect of potency conditioning. The biochemical language model also generated structurally diverse candidate compounds departing from both fine-tuning and test compounds. Overall, generative compound design based on potency value-conditioned target sequence embeddings yielded promising results, rendering the approach attractive for further exploration and practical applications. SCIENTIFIC CONTRIBUTION: The approach introduced herein combines protein language model and chemical language model components, representing an advanced architecture, and is the first methodology for predicting compounds with desired potency from conditioned protein sequence data.
Collapse
Affiliation(s)
- Hengwei Chen
- Department of Life Science Informatics and Data Science, B-IT, Lamarr Institute for Machine Learning and Artificial Intelligence, LIMES Program Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, Lamarr Institute for Machine Learning and Artificial Intelligence, LIMES Program Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|
40
|
Wei H, Gao L, Wu S, Jiang Y, Liu B. DiSMVC: a multi-view graph collaborative learning framework for measuring disease similarity. Bioinformatics 2024; 40:btae306. [PMID: 38715444 PMCID: PMC11256965 DOI: 10.1093/bioinformatics/btae306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 04/19/2024] [Accepted: 05/05/2024] [Indexed: 05/30/2024] Open
Abstract
MOTIVATION Exploring potential associations between diseases can help in understanding pathological mechanisms of diseases and facilitating the discovery of candidate biomarkers and drug targets, thereby promoting disease diagnosis and treatment. Some computational methods have been proposed for measuring disease similarity. However, these methods describe diseases without considering their latent multi-molecule regulation and valuable supervision signal, resulting in limited biological interpretability and efficiency to capture association patterns. RESULTS In this study, we propose a new computational method named DiSMVC. Different from existing predictors, DiSMVC designs a supervised graph collaborative framework to measure disease similarity. Multiple bio-entity associations related to genes and miRNAs are integrated via cross-view graph contrastive learning to extract informative disease representation, and then association pattern joint learning is implemented to compute disease similarity by incorporating phenotype-annotated disease associations. The experimental results show that DiSMVC can draw discriminative characteristics for disease pairs, and outperform other state-of-the-art methods. As a result, DiSMVC is a promising method for predicting disease associations with molecular interpretability. AVAILABILITY AND IMPLEMENTATION Datasets and source codes are available at https://github.com/Biohang/DiSMVC.
Collapse
Affiliation(s)
- Hang Wei
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi 710126, China
| | - Lin Gao
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi 710126, China
| | - Shuai Wu
- School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi 710126, China
| | - Yina Jiang
- Department of Basic Medicine, Shaanxi University of Chinese Medicine, Xianyang, Shaanxi 712046, China
| | - Bin Liu
- Faculty of Engineering, Shenzhen MSU-BIT University, Shenzhen, Guangdong 518172, China
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, 100081, China
| |
Collapse
|
41
|
Ding K, Luo J, Luo Y. Leveraging conformal prediction to annotate enzyme function space with limited false positives. PLoS Comput Biol 2024; 20:e1012135. [PMID: 38809942 PMCID: PMC11164347 DOI: 10.1371/journal.pcbi.1012135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Revised: 06/10/2024] [Accepted: 05/03/2024] [Indexed: 05/31/2024] Open
Abstract
Machine learning (ML) is increasingly being used to guide biological discovery in biomedicine such as prioritizing promising small molecules in drug discovery. In those applications, ML models are used to predict the properties of biological systems, and researchers use these predictions to prioritize candidates as new biological hypotheses for downstream experimental validations. However, when applied to unseen situations, these models can be overconfident and produce a large number of false positives. One solution to address this issue is to quantify the model's prediction uncertainty and provide a set of hypotheses with a controlled false discovery rate (FDR) pre-specified by researchers. We propose CPEC, an ML framework for FDR-controlled biological discovery. We demonstrate its effectiveness using enzyme function annotation as a case study, simulating the discovery process of identifying the functions of less-characterized enzymes. CPEC integrates a deep learning model with a statistical tool known as conformal prediction, providing accurate and FDR-controlled function predictions for a given protein enzyme. Conformal prediction provides rigorous statistical guarantees to the predictive model and ensures that the expected FDR will not exceed a user-specified level with high probability. Evaluation experiments show that CPEC achieves reliable FDR control, better or comparable prediction performance at a lower FDR than existing methods, and accurate predictions for enzymes under-represented in the training data. We expect CPEC to be a useful tool for biological discovery applications where a high yield rate in validation experiments is desired but the experimental budget is limited.
Collapse
Affiliation(s)
- Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Jiaqi Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| |
Collapse
|
42
|
Ovek D, Keskin O, Gursoy A. ProInterVal: Validation of Protein-Protein Interfaces through Learned Interface Representations. J Chem Inf Model 2024; 64:2979-2987. [PMID: 38526504 PMCID: PMC11040718 DOI: 10.1021/acs.jcim.3c01788] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/21/2024] [Accepted: 02/22/2024] [Indexed: 03/26/2024]
Abstract
Proteins are vital components of the biological world and serve a multitude of functions. They interact with other molecules through their interfaces and participate in crucial cellular processes. Disruption of these interactions can have negative effects on organisms, highlighting the importance of studying protein-protein interfaces for developing targeted therapies for diseases. Therefore, the development of a reliable method for investigating protein-protein interactions is of paramount importance. In this work, we present an approach for validating protein-protein interfaces using learned interface representations. The approach involves using a graph-based contrastive autoencoder architecture and a transformer to learn representations of protein-protein interaction interfaces from unlabeled data and then validating them through learned representations with a graph neural network. Our method achieves an accuracy of 0.91 for the test set, outperforming existing GNN-based methods. We demonstrate the effectiveness of our approach on a benchmark data set and show that it provides a promising solution for validating protein-protein interfaces.
Collapse
Affiliation(s)
- Damla Ovek
- KUIS
AI Center, Koç University, Istanbul 34450, Turkey
- Computer
Engineering, Koç University, Istanbul 34450, Turkey
| | - Ozlem Keskin
- Chemical
and Biological Engineering, Koç University, Istanbul 34450, Turkey
| | - Attila Gursoy
- Computer
Engineering, Koç University, Istanbul 34450, Turkey
| |
Collapse
|
43
|
Ozalp MK, Vignaux PA, Puhl AC, Lane TR, Urbina F, Ekins S. Sequential Contrastive and Deep Learning Models to Identify Selective Butyrylcholinesterase Inhibitors. J Chem Inf Model 2024; 64:3161-3172. [PMID: 38532612 PMCID: PMC11331448 DOI: 10.1021/acs.jcim.4c00397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2024]
Abstract
Butyrylcholinesterase (BChE) is a target of interest in late-stage Alzheimer's Disease (AD) where selective BChE inhibitors (BIs) may offer symptomatic treatment without the harsh side effects of acetylcholinesterase (AChE) inhibitors. In this study, we explore multiple machine learning strategies to identify BIs in silico, optimizing for precision over all other metrics. We compare state-of-the-art supervised contrastive learning (CL) with deep learning (DL) and Random Forest (RF) machine learning, across single and sequential modeling configurations, to identify the best models for BChE selectivity. We used these models to virtually screen a vendor library of 5 million compounds for BIs and tested 20 of these compounds in vitro. Seven of the 20 compounds displayed selectivity for BChE over AChE, reflecting a hit rate of 35% for our model predictions, suggesting a highly efficient strategy for modeling selective inhibition.
Collapse
Affiliation(s)
| | | | - Ana C. Puhl
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Thomas R. Lane
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Fabio Urbina
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| |
Collapse
|
44
|
Meimetis N, Lauffenburger DA, Nilsson A. Inference of drug off-target effects on cellular signaling using interactome-based deep learning. iScience 2024; 27:109509. [PMID: 38591003 PMCID: PMC11000001 DOI: 10.1016/j.isci.2024.109509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 02/04/2024] [Accepted: 03/13/2024] [Indexed: 04/10/2024] Open
Abstract
Many diseases emerge from dysregulated cellular signaling, and drugs are often designed to target specific signaling proteins. Off-target effects are, however, common and may ultimately result in failed clinical trials. Here we develop a computer model of the cell's transcriptional response to drugs for improved understanding of their mechanisms of action. The model is based on ensembles of artificial neural networks and simultaneously infers drug-target interactions and their downstream effects on intracellular signaling. With this, it predicts transcription factors' activities, while recovering known drug-target interactions and inferring many new ones, which we validate with an independent dataset. As a case study, we analyze the effects of the drug Lestaurtinib on downstream signaling. Alongside its intended target, FLT3, the model predicts an inhibition of CDK2 that enhances the downregulation of the cell cycle-critical transcription factor FOXM1. Our approach can therefore enhance our understanding of drug signaling for therapeutic design.
Collapse
Affiliation(s)
- Nikolaos Meimetis
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Douglas A. Lauffenburger
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Avlant Nilsson
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Cell and Molecular Biology, SciLifeLab, Karolinska Institutet, Stockholm, Sweden
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, SE 41296, Sweden
| |
Collapse
|
45
|
Qiu Y, Cheng F. Artificial intelligence for drug discovery and development in Alzheimer's disease. Curr Opin Struct Biol 2024; 85:102776. [PMID: 38335558 DOI: 10.1016/j.sbi.2024.102776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 12/29/2023] [Accepted: 01/15/2024] [Indexed: 02/12/2024]
Abstract
The complex molecular mechanism and pathophysiology of Alzheimer's disease (AD) limits the development of effective therapeutics or prevention strategies. Artificial Intelligence (AI)-guided drug discovery combined with genetics/multi-omics (genomics, epigenomics, transcriptomics, proteomics, and metabolomics) analysis contributes to the understanding of the pathophysiology and precision medicine of the disease, including AD and AD-related dementia. In this review, we summarize the AI-driven methodologies for AD-agnostic drug discovery and development, including de novo drug design, virtual screening, and prediction of drug-target interactions, all of which have shown potentials. In particular, AI-based drug repurposing emerges as a compelling strategy to identify new indications for existing drugs for AD. We provide several emerging AD targets from human genetics and multi-omics findings and highlight recent AI-based technologies and their applications in drug discovery using AD as a prototypical example. In closing, we discuss future challenges and directions in AI-based drug discovery for AD and other neurodegenerative diseases.
Collapse
Affiliation(s)
- Yunguang Qiu
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA. https://twitter.com/YunguangQiu
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA; Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA; Cleveland Clinic Genome Center, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA.
| |
Collapse
|
46
|
Luo D, Liu D, Qu X, Dong L, Wang B. Enhancing Generalizability in Protein-Ligand Binding Affinity Prediction with Multimodal Contrastive Learning. J Chem Inf Model 2024; 64:1892-1906. [PMID: 38441880 DOI: 10.1021/acs.jcim.3c01961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/26/2024]
Abstract
Improving the generalization ability of scoring functions remains a major challenge in protein-ligand binding affinity prediction. Many machine learning methods are limited by their reliance on single-modal representations, hindering a comprehensive understanding of protein-ligand interactions. We introduce a graph-neural-network-based scoring function that utilizes a triplet contrastive learning loss to improve protein-ligand representations. In this model, three-dimensional complex representations and the fusion of two-dimensional ligand and coarse-grained pocket representations converge while distancing from decoy representations in latent space. After rigorous validation on multiple external data sets, our model exhibits commendable generalization capabilities compared to those of other deep learning-based scoring functions, marking it as a promising tool in the realm of drug discovery. In the future, our training framework can be extended to other biophysical- and biochemical-related problems such as protein-protein interaction and protein mutation prediction.
Collapse
Affiliation(s)
- Ding Luo
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Dandan Liu
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Xiaoyang Qu
- School of Pharmacy and Medical Technology, Putian University, Putian 351100, P. R. China
- Key Laboratory of Pharmaceutical Analysis and Laboratory Medicine (Putian University), Fujian Province University, Putian 351100, P. R. China
| | - Lina Dong
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
| | - Binju Wang
- State Key Laboratory of Physical Chemistry of Solid Surfaces and Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, College of Chemistry and Chemical Engineering, Xiamen University, Xiamen 361005, P. R. China
- Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen 361005, P. R. China
| |
Collapse
|
47
|
Singh N, Singh AK. In Silico Structural Modeling and Binding Site Analysis of Cerebroside Sulfotransferase (CST): A Therapeutic Target for Developing Substrate Reduction Therapy for Metachromatic Leukodystrophy. ACS OMEGA 2024; 9:10748-10768. [PMID: 38463293 PMCID: PMC10918841 DOI: 10.1021/acsomega.3c09462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Revised: 01/26/2024] [Accepted: 01/31/2024] [Indexed: 03/12/2024]
Abstract
Cerebroside sulfotransferase (CST) is emerging as an important therapeutic target to develop substrate reduction therapy (SRT) for metachromatic leukodystrophy (MLD), a rare neurodegenerative lysosomal storage disorder. MLD develops with progressive impairment and destruction of the myelin sheath as a result of accumulation of sulfatide around the nerve cells in the absence of its recycling mechanism with deficiency of arylsulfatase A (ARSA). Sulfatide is the product of the catalytic action of cerebroside sulfotransferase (CST), which needs to be regulated under pathophysiological conditions by inhibitor development. To carry out in silico-based preliminary drug screening or for designing new drug candidates, a high-quality three-dimensional (3D) structure is needed in the absence of an experimentally derived three-dimensional crystal structure. In this study, a 3D model of the protein was developed using a primary sequence with the SWISS-MODEL server by applying the top four GMEQ score-based templates belonging to the sulfotransferase family as a reference. The 3D model of CST highlights the features of the protein responsible for its catalytic action. The CST model comprises five β-strands, which are flanked by ten α-helices from both sides as well as form the upside cover of the catalytic pocket of CST. CST has two catalytic regions: PAPS (-sulfo donor) binding and galactosylceramide (-sulfo acceptor) binding. The catalytic action of CST was proposed via molecular docking and molecular dynamic (MD) simulation with PAPS, galactosylceramide (GC), PAPS-galactosylceramide, and PAP. The stability of the model and its catalytic action were confirmed using molecular dynamic simulation-based trajectory analysis. CST response against the inhibition potential of the experimentally reported competitive inhibitor of CST was confirmed via molecular docking and molecular dynamics simulation, which suggested the suitability of the CST model for future drug discovery to strengthen substrate reduction therapy for MLD.
Collapse
Affiliation(s)
- Nivedita Singh
- Department of Dravyaguna,
Faculty of Ayurveda, Institute of Medical
Sciences, Banaras Hindu University, Varanasi 221005, Uttar Pradesh, India
| | - Anil Kumar Singh
- Department of Dravyaguna,
Faculty of Ayurveda, Institute of Medical
Sciences, Banaras Hindu University, Varanasi 221005, Uttar Pradesh, India
| |
Collapse
|
48
|
Smith MD, Darryl Quarles L, Demerdash O, Smith JC. Drugging the entire human proteome: Are we there yet? Drug Discov Today 2024; 29:103891. [PMID: 38246414 DOI: 10.1016/j.drudis.2024.103891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 01/12/2024] [Accepted: 01/16/2024] [Indexed: 01/23/2024]
Abstract
Each of the ∼20,000 proteins in the human proteome is a potential target for compounds that bind to it and modify its function. The 3D structures of most of these proteins are now available. Here, we discuss the prospects for using these structures to perform proteome-wide virtual HTS (VHTS). We compare physics-based (docking) and AI VHTS approaches, some of which are now being applied with large databases of compounds to thousands of targets. Although preliminary proteome-wide screens are now within our grasp, further methodological developments are expected to improve the accuracy of the results.
Collapse
Affiliation(s)
- Micholas Dean Smith
- University of Tennessee/Oak Ridge National Laboratory Center for Molecular Biophysics, Oak Ridge, TN 37830, USA; Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, TN 37996, USA
| | - L Darryl Quarles
- Departments of Medicine, University of Tennessee Health Science Center, Memphis, TN 38163, USA; ORRxD LLC, 3404 Olney Drive, Durham, NC 27705, USA
| | - Omar Demerdash
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
| | - Jeremy C Smith
- University of Tennessee/Oak Ridge National Laboratory Center for Molecular Biophysics, Oak Ridge, TN 37830, USA; Department of Biochemistry and Cellular and Molecular Biology, University of Tennessee, Knoxville, TN 37996, USA.
| |
Collapse
|
49
|
Taujale R, Gravel N, Zhou Z, Yeung W, Kochut K, Kannan N. Informatic challenges and advances in illuminating the druggable proteome. Drug Discov Today 2024; 29:103894. [PMID: 38266979 DOI: 10.1016/j.drudis.2024.103894] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 01/08/2024] [Accepted: 01/17/2024] [Indexed: 01/26/2024]
Abstract
The understudied members of the druggable proteomes offer promising prospects for drug discovery efforts. While large-scale initiatives have generated valuable functional information on understudied members of the druggable gene families, translating this information into actionable knowledge for drug discovery requires specialized informatics tools and resources. Here, we review the unique informatics challenges and advances in annotating understudied members of the druggable proteome. We demonstrate the application of statistical evolutionary inference tools, knowledge graph mining approaches, and protein language models in illuminating understudied protein kinases, pseudokinases, and ion channels.
Collapse
Affiliation(s)
- Rahil Taujale
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA
| | - Nathan Gravel
- Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | | | - Wayland Yeung
- Institute of Bioinformatics, University of Georgia, Athens, GA, USA
| | - Krystof Kochut
- School of Computing, University of Georgia, Athens, GA, USA
| | - Natarajan Kannan
- Department of Biochemistry and Molecular Biology, University of Georgia, Athens, GA, USA; Institute of Bioinformatics, University of Georgia, Athens, GA, USA.
| |
Collapse
|
50
|
Scharf MM, Humphrys LJ, Berndt S, Di Pizio A, Lehmann J, Liebscher I, Nicoli A, Niv MY, Peri L, Schihada H, Schulte G. The dark sides of the GPCR tree - research progress on understudied GPCRs. Br J Pharmacol 2024. [PMID: 38339984 DOI: 10.1111/bph.16325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 11/24/2023] [Accepted: 01/08/2024] [Indexed: 02/12/2024] Open
Abstract
A large portion of the human GPCRome is still in the dark and understudied, consisting even of entire subfamilies of GPCRs such as odorant receptors, class A and C orphans, adhesion GPCRs, Frizzleds and taste receptors. However, it is undeniable that these GPCRs bring an untapped therapeutic potential that should be explored further. Open questions on these GPCRs span diverse topics such as deorphanisation, the development of tool compounds and tools for studying these GPCRs, as well as understanding basic signalling mechanisms. This review gives an overview of the current state of knowledge for each of the diverse subfamilies of understudied receptors regarding their physiological relevance, molecular mechanisms, endogenous ligands and pharmacological tools. Furthermore, it identifies some of the largest knowledge gaps that should be addressed in the foreseeable future and lists some general strategies that might be helpful in this process.
Collapse
Affiliation(s)
- Magdalena M Scharf
- Karolinska Institutet, Dept. Physiology & Pharmacology, Sec. Receptor Biology & Signaling, Stockholm, Sweden
| | - Laura J Humphrys
- Institute of Pharmacy, University of Regensburg, Regensburg, Germany
| | - Sandra Berndt
- Rudolf Schönheimer Institute for Biochemistry, Molecular Biochemistry, University of Leipzig, Leipzig, Germany
| | - Antonella Di Pizio
- Leibniz Institute for Food Systems Biology at the Technical University of Munich, Freising, Germany
- Chemoinformatics and Protein Modelling, Department of Molecular Life Science, School of Life Science, Technical University of Munich, Freising, Germany
| | - Juliane Lehmann
- Rudolf Schönheimer Institute for Biochemistry, Molecular Biochemistry, University of Leipzig, Leipzig, Germany
| | - Ines Liebscher
- Rudolf Schönheimer Institute for Biochemistry, Molecular Biochemistry, University of Leipzig, Leipzig, Germany
| | - Alessandro Nicoli
- Leibniz Institute for Food Systems Biology at the Technical University of Munich, Freising, Germany
- Chemoinformatics and Protein Modelling, Department of Molecular Life Science, School of Life Science, Technical University of Munich, Freising, Germany
| | - Masha Y Niv
- The Institute of Biochemistry, Food Science and Nutrition, Robert H. Smith Faculty of Agriculture, Food and Environment, The Hebrew University of Jerusalem, Rehovot, Israel
| | - Lior Peri
- The Institute of Biochemistry, Food Science and Nutrition, Robert H. Smith Faculty of Agriculture, Food and Environment, The Hebrew University of Jerusalem, Rehovot, Israel
| | - Hannes Schihada
- Institute of Pharmaceutical Chemistry, Philipps-University Marburg, Marburg, Germany
| | - Gunnar Schulte
- Karolinska Institutet, Dept. Physiology & Pharmacology, Sec. Receptor Biology & Signaling, Stockholm, Sweden
| |
Collapse
|