1
|
Li X, Yang Q, Xu L, Dong W, Luo G, Wang W, Dong S, Wang K, Xuan P, Zhang X, Gao X. DrugMGR: a deep bioactive molecule binding method to identify compounds targeting proteins. Bioinformatics 2024; 40:btae176. [PMID: 38561176 PMCID: PMC11015954 DOI: 10.1093/bioinformatics/btae176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 03/04/2024] [Accepted: 03/28/2024] [Indexed: 04/04/2024] Open
Abstract
MOTIVATION Understanding the intermolecular interactions of ligand-target pairs is key to guiding the optimization of drug research on cancers, which can greatly mitigate overburden workloads for wet labs. Several improved computational methods have been introduced and exhibit promising performance for these identification tasks, but some pitfalls restrict their practical applications: (i) first, existing methods do not sufficiently consider how multigranular molecule representations influence interaction patterns between proteins and compounds; and (ii) second, existing methods seldom explicitly model the binding sites when an interaction occurs to enable better prediction and interpretation, which may lead to unexpected obstacles to biological researchers. RESULTS To address these issues, we here present DrugMGR, a deep multigranular drug representation model capable of predicting binding affinities and regions for each ligand-target pair. We conduct consistent experiments on three benchmark datasets using existing methods and introduce a new specific dataset to better validate the prediction of binding sites. For practical application, target-specific compound identification tasks are also carried out to validate the capability of real-world compound screen. Moreover, the visualization of some practical interaction scenarios provides interpretable insights from the results of the predictions. The proposed DrugMGR achieves excellent overall performance in these datasets, exhibiting its advantages and merits against state-of-the-art methods. Thus, the downstream task of DrugMGR can be fine-tuned for identifying the potential compounds that target proteins for clinical treatment. AVAILABILITY AND IMPLEMENTATION https://github.com/lixiaokun2020/DrugMGR.
Collapse
Affiliation(s)
- Xiaokun Li
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
- School of Computer Science and Technology, Heilongjiang University, Harbin 150080, China
- Postdoctoral Program of Heilongjiang Hengxun Technology Co., Ltd., Harbin 150090, China
| | - Qiang Yang
- School of Medicine and Health, Harbin Institute of Technology, Harbin 150000, China
| | - Long Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Weihe Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Gongning Luo
- Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, KAUST, Thuwal 23955, Saudi Arabia
| | - Wei Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
| | - Suyu Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Kuanquan Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Ping Xuan
- Department of Computer Science, School of Engineering, Shantou University, Shantou 515063, China
| | - Xianyu Zhang
- Department of Breast Surgery, Harbin Medical University Cancer Hospital, Harbin 150081, China
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University of Science and Technology, KAUST, Thuwal 23955, Saudi Arabia
| |
Collapse
|
2
|
Malusare A, Aggarwal V. Improving Molecule Generation and Drug Discovery with a Knowledge-enhanced Generative Model. ARXIV 2024:arXiv:2402.08790v1. [PMID: 38410649 PMCID: PMC10896363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/28/2024]
Abstract
Recent advancements in generative models have established state-of-the-art benchmarks in the generation of molecules and novel drug candidates. Despite these successes, a significant gap persists between generative models and the utilization of extensive biomedical knowledge, often systematized within knowledge graphs, whose potential to inform and enhance generative processes has not been realized. In this paper, we present a novel approach that bridges this divide by developing a framework for knowledge-enhanced generative models called K-DReAM. We develop a scalable methodology to extend the functionality of knowledge graphs while preserving semantic integrity, and incorporate this contextual information into a generative framework to guide a diffusion-based model. The integration of knowledge graph embeddings with our generative model furnishes a robust mechanism for producing novel drug candidates possessing specific characteristics while ensuring validity and synthesizability. K-DReAM outperforms state-of-the-art generative models on both unconditional and targeted generation tasks.
Collapse
Affiliation(s)
- Aditya Malusare
- School of Industrial Engineering, Purdue University, USA
- Purdue Institute for Cancer Research, Purdue University, USA
| | - Vaneet Aggarwal
- School of Industrial Engineering, Purdue University, USA
- Purdue Institute for Cancer Research, Purdue University, USA
| |
Collapse
|
3
|
Liu J, Yang M, Yu Y, Xu H, Li K, Zhou X. Large language models in bioinformatics: applications and perspectives. ARXIV 2024:arXiv:2401.04155v1. [PMID: 38259343 PMCID: PMC10802675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will present a summary of the prominent large language models used in natural language processing, such as BERT and GPT, and focus on exploring the applications of large language models at different omics levels in bioinformatics, mainly including applications of large language models in genomics, transcriptomics, proteomics, drug discovery and single cell analysis. Finally, this review summarizes the potential and prospects of large language models in solving bioinformatic problems.
Collapse
Affiliation(s)
- Jiajia Liu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Mengyuan Yang
- School of Life Sciences, Zhengzhou University, Zhengzhou, Henan 450001, China
| | - Yankai Yu
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
| | - Haixia Xu
- The Center of Gerontology and Geriatrics, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- School of Dentistry, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
4
|
Pei Q, Wu L, Zhu J, Xia Y, Xie S, Qin T, Liu H, Liu TY, Yan R. Breaking the barriers of data scarcity in drug-target affinity prediction. Brief Bioinform 2023; 24:bbad386. [PMID: 37903413 DOI: 10.1093/bib/bbad386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/14/2023] [Accepted: 10/05/2023] [Indexed: 11/01/2023] Open
Abstract
Accurate prediction of drug-target affinity (DTA) is of vital importance in early-stage drug discovery, facilitating the identification of drugs that can effectively interact with specific targets and regulate their activities. While wet experiments remain the most reliable method, they are time-consuming and resource-intensive, resulting in limited data availability that poses challenges for deep learning approaches. Existing methods have primarily focused on developing techniques based on the available DTA data, without adequately addressing the data scarcity issue. To overcome this challenge, we present the Semi-Supervised Multi-task training (SSM) framework for DTA prediction, which incorporates three simple yet highly effective strategies: (1) A multi-task training approach that combines DTA prediction with masked language modeling using paired drug-target data. (2) A semi-supervised training method that leverages large-scale unpaired molecules and proteins to enhance drug and target representations. This approach differs from previous methods that only employed molecules or proteins in pre-training. (3) The integration of a lightweight cross-attention module to improve the interaction between drugs and targets, further enhancing prediction accuracy. Through extensive experiments on benchmark datasets such as BindingDB, DAVIS and KIBA, we demonstrate the superior performance of our framework. Additionally, we conduct case studies on specific drug-target binding activities, virtual screening experiments, drug feature visualizations and real-world applications, all of which showcase the significant potential of our work. In conclusion, our proposed SSM-DTA framework addresses the data limitation challenge in DTA prediction and yields promising results, paving the way for more efficient and accurate drug discovery processes.
Collapse
Affiliation(s)
- Qizhi Pei
- Gaoling School of Artificial Intelligence, Renmin University of China, No.59, Zhong Guan Cun Avenue, Haidian District, 100872, Beijing, China
| | - Lijun Wu
- Microsoft Research AI4Science, No.5, Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Jinhua Zhu
- CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China, No.96, JinZhai Road, Baohe District, 230026, Hefei, Anhui Province, China
| | - Yingce Xia
- Microsoft Research AI4Science, No.5, Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Shufang Xie
- Gaoling School of Artificial Intelligence, Renmin University of China, No.59, Zhong Guan Cun Avenue, Haidian District, 100872, Beijing, China
| | - Tao Qin
- Engineering Research Center of Next-Generation Intelligent Search and Recommendation, Ministry of Education
| | - Haiguang Liu
- Microsoft Research AI4Science, No.5, Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Tie-Yan Liu
- Microsoft Research AI4Science, No.5, Dan Ling Street, Haidian District, 100080, Beijing, China
| | - Rui Yan
- Beijing Key Laboratory of Big Data Management and Analysis Methods
| |
Collapse
|
5
|
Ma J, Li C, Zhang Y, Wang Z, Li S, Guo Y, Zhang L, Liu H, Gao X, Song J. MULGA, a unified multi-view graph autoencoder-based approach for identifying drug-protein interaction and drug repositioning. Bioinformatics 2023; 39:btad524. [PMID: 37610353 PMCID: PMC10518077 DOI: 10.1093/bioinformatics/btad524] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 07/26/2023] [Accepted: 08/22/2023] [Indexed: 08/24/2023] Open
Abstract
MOTIVATION Identifying drug-protein interactions (DPIs) is a critical step in drug repositioning, which allows reuse of approved drugs that may be effective for treating a different disease and thereby alleviates the challenges of new drug development. Despite the fact that a great variety of computational approaches for DPI prediction have been proposed, key challenges, such as extendable and unbiased similarity calculation, heterogeneous information utilization, and reliable negative sample selection, remain to be addressed. RESULTS To address these issues, we propose a novel, unified multi-view graph autoencoder framework, termed MULGA, for both DPI and drug repositioning predictions. MULGA is featured by: (i) a multi-view learning technique to effectively learn authentic drug affinity and target affinity matrices; (ii) a graph autoencoder to infer missing DPI interactions; and (iii) a new "guilty-by-association"-based negative sampling approach for selecting highly reliable non-DPIs. Benchmark experiments demonstrate that MULGA outperforms state-of-the-art methods in DPI prediction and the ablation studies verify the effectiveness of each proposed component. Importantly, we highlight the top drugs shortlisted by MULGA that target the spike glycoprotein of severe acute respiratory syndrome coronavirus 2 (SAR-CoV-2), offering additional insights into and potentially useful treatment option for COVID-19. Together with the availability of datasets and source codes, we envision that MULGA can be explored as a useful tool for DPI prediction and drug repositioning. AVAILABILITY AND IMPLEMENTATION MULGA is publicly available for academic purposes at https://github.com/jianiM/MULGA/.
Collapse
Affiliation(s)
- Jiani Ma
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Chen Li
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yiwen Zhang
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Zhikang Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Shanshan Li
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Yuming Guo
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC 3004, Australia
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Hui Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Xin Gao
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Wenzhou Medical University-Monash Biomedicine Discovery Institute (BDI) Alliance in Clinical and Experimental Biomedicine, Wenzhou 325035, China
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
6
|
Yang X, Niu Z, Liu Y, Song B, Lu W, Zeng L, Zeng X. Modality-DTA: Multimodality Fusion Strategy for Drug-Target Affinity Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1200-1210. [PMID: 36083952 DOI: 10.1109/tcbb.2022.3205282] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Prediction of the drug-target affinity (DTA) plays an important role in drug discovery. Existing deep learning methods for DTA prediction typically leverage a single modality, namely simplified molecular input line entry specification (SMILES) or amino acid sequence to learn representations. SMILES or amino acid sequences can be encoded into different modalities. Multimodality data provide different kinds of information, with complementary roles for DTA prediction. We propose Modality-DTA, a novel deep learning method for DTA prediction that leverages the multimodality of drugs and targets. A group of backward propagation neural networks is applied to ensure the completeness of the reconstruction process from the latent feature representation to original multimodality data. The tag between the drug and target is used to reduce the noise information in the latent representation from multimodality data. Experiments on three benchmark datasets show that our Modality-DTA outperforms existing methods in all metrics. Modality-DTA reduces the mean square error by 15.7% and improves the area under the precisionrecall curve by 12.74% in the Davis dataset. We further find that the drug modality Morgan fingerprint and the target modality generated by one-hot-encoding play the most significant roles. To the best of our knowledge, Modality-DTA is the first method to explore multimodality for DTA prediction.
Collapse
|