1
|
Khawaja SA, Farooq MS, Ishaq K, Alsubaie N, Karamti H, Montero EC, Alvarado ES, Ashraf I. Prediction of leukemia peptides using convolutional neural network and protein compositions. BMC Cancer 2024; 24:900. [PMID: 39060972 PMCID: PMC11282659 DOI: 10.1186/s12885-024-12609-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 07/08/2024] [Indexed: 07/28/2024] Open
Abstract
Leukemia is a type of blood cell cancer that is in the bone marrow's blood-forming cells. Two types of Leukemia are acute and chronic; acute enhances fast and chronic growth gradually which are further classified into lymphocytic and myeloid leukemias. This work evaluates a unique deep convolutional neural network (CNN) classifier that improves identification precision by carefully examining concatenated peptide patterns. The study uses leukemia protein expression for experiments supporting two different techniques including independence and applied cross-validation. In addition to CNN, multilayer perceptron (MLP), gated recurrent unit (GRU), and recurrent neural network (RNN) are applied. The experimental results show that the CNN model surpasses competitors with its outstanding predictability in independent and cross-validation testing applied on different features extracted from protein expressions such as amino acid composition (AAC) with a group of AAC (GAAC), tripeptide composition (TPC) with a group of TPC (GTPC), and dipeptide composition (DPC) for calculating its accuracies with their receiver operating characteristic (ROC) curve. In independence testing, a feature expression of AAC and a group of GAAC are applied using MLP and CNN modules, and ROC curves are achieved with overall 100% accuracy for the detection of protein patterns. In cross-validation testing, a feature expression on a group of AAC and GAAC patterns achieved 98.33% accuracy which is the highest for the CNN module. Furthermore, ROC curves show a 0.965% extraordinary result for the GRU module. The findings show that the CNN model is excellent at figuring out leukemia illnesses from protein expressions with higher accuracy.
Collapse
Affiliation(s)
- Seher Ansar Khawaja
- School of System and Technology, University of Management and Technology, Lahore, 54000, Pakistan
| | - Muhammad Shoaib Farooq
- School of System and Technology, University of Management and Technology, Lahore, 54000, Pakistan
| | - Kashif Ishaq
- School of System and Technology, University of Management and Technology, Lahore, 54000, Pakistan
| | - Najah Alsubaie
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O.Box 84428, Riyadh, 11671, Saudi Arabia
| | - Hanen Karamti
- Department of Computer Sciences, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O.Box 84428, Riyadh, 11671, Saudi Arabia
| | - Elizabeth Caro Montero
- Universidad Europea del Atlántico, Isabel Torres 21, 39011, Santander, Spain
- Universidad Internacional Iberoamericana Arecibo, Puerto Rico, 00613, USA
- Universidade Internacional do Cuanza, Cuito, Bié, Angola
| | - Eduardo Silva Alvarado
- Universidad Europea del Atlántico, Isabel Torres 21, 39011, Santander, Spain
- Universidad Internacional Iberoamericana, Campeche, 24560, México
- Universidad de La Romana, La Romana, República Dominicana
| | - Imran Ashraf
- Information and Communication Engineering, Yeungnam University, Gyeongsan, 38541, Korea.
| |
Collapse
|
2
|
Chu H, Liu T. Comprehensive Research on Druggable Proteins: From PSSM to Pre-Trained Language Models. Int J Mol Sci 2024; 25:4507. [PMID: 38674091 PMCID: PMC11049818 DOI: 10.3390/ijms25084507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 04/15/2024] [Accepted: 04/17/2024] [Indexed: 04/28/2024] Open
Abstract
Identification of druggable proteins can greatly reduce the cost of discovering new potential drugs. Traditional experimental approaches to exploring these proteins are often costly, slow, and labor-intensive, making them impractical for large-scale research. In response, recent decades have seen a rise in computational methods. These alternatives support drug discovery by creating advanced predictive models. In this study, we proposed a fast and precise classifier for the identification of druggable proteins using a protein language model (PLM) with fine-tuned evolutionary scale modeling 2 (ESM-2) embeddings, achieving 95.11% accuracy on the benchmark dataset. Furthermore, we made a careful comparison to examine the predictive abilities of ESM-2 embeddings and position-specific scoring matrix (PSSM) features by using the same classifiers. The results suggest that ESM-2 embeddings outperformed PSSM features in terms of accuracy and efficiency. Recognizing the potential of language models, we also developed an end-to-end model based on the generative pre-trained transformers 2 (GPT-2) with modifications. To our knowledge, this is the first time a large language model (LLM) GPT-2 has been deployed for the recognition of druggable proteins. Additionally, a more up-to-date dataset, known as Pharos, was adopted to further validate the performance of the proposed model.
Collapse
Affiliation(s)
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China;
| |
Collapse
|
3
|
Arif M, Fang G, Ghulam A, Musleh S, Alam T. DPI_CDF: druggable protein identifier using cascade deep forest. BMC Bioinformatics 2024; 25:145. [PMID: 38580921 DOI: 10.1186/s12859-024-05744-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 03/13/2024] [Indexed: 04/07/2024] Open
Abstract
BACKGROUND Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory. METHODS In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF. RESULTS The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process. AVAILABILITY The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .
Collapse
Affiliation(s)
- Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Ge Fang
- State Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing 210023, P. R. China, Nanjing 210023, China
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bankok, 10700, Thailand
| | - Ali Ghulam
- Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan
| | - Saleh Musleh
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.
| |
Collapse
|
4
|
Chen J, Gu Z, Lai L, Pei J. In silico protein function prediction: the rise of machine learning-based approaches. MEDICAL REVIEW (2021) 2023; 3:487-510. [PMID: 38282798 PMCID: PMC10808870 DOI: 10.1515/mr-2023-0038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 10/11/2023] [Indexed: 01/30/2024]
Abstract
Proteins function as integral actors in essential life processes, rendering the realm of protein research a fundamental domain that possesses the potential to propel advancements in pharmaceuticals and disease investigation. Within the context of protein research, an imperious demand arises to uncover protein functionalities and untangle intricate mechanistic underpinnings. Due to the exorbitant costs and limited throughput inherent in experimental investigations, computational models offer a promising alternative to accelerate protein function annotation. In recent years, protein pre-training models have exhibited noteworthy advancement across multiple prediction tasks. This advancement highlights a notable prospect for effectively tackling the intricate downstream task associated with protein function prediction. In this review, we elucidate the historical evolution and research paradigms of computational methods for predicting protein function. Subsequently, we summarize the progress in protein and molecule representation as well as feature extraction techniques. Furthermore, we assess the performance of machine learning-based algorithms across various objectives in protein function prediction, thereby offering a comprehensive perspective on the progress within this field.
Collapse
Affiliation(s)
- Jiaxiao Chen
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Zhonghui Gu
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| | - Jianfeng Pei
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014), Beijing, China
| |
Collapse
|
5
|
Yu L, Zhang Y, Xue L, Liu F, Jing R, Luo J. EnsembleDL-ATG: Identifying autophagy proteins by integrating their sequence and evolutionary information using an ensemble deep learning framework. Comput Struct Biotechnol J 2023; 21:4836-4848. [PMID: 37854634 PMCID: PMC10579870 DOI: 10.1016/j.csbj.2023.09.036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 09/26/2023] [Accepted: 09/27/2023] [Indexed: 10/20/2023] Open
Abstract
Autophagy is a primary mechanism for maintaining cellular homeostasis. The synergistic actions of autophagy-related (ATG) proteins strictly regulate the whole autophagic process. Therefore, accurate identification of ATGs is a first and critical step to reveal the molecular mechanism underlying the regulation of autophagy. Current computational methods can predict ATGs from primary protein sequences, but owing to the limitations of algorithms, significant room for improvement still exists. In this research, we propose EnsembleDL-ATG, an ensemble deep learning framework that aggregates multiple deep learning models to predict ATGs from protein sequence and evolutionary information. We first evaluated the performance of individual networks for various feature descriptors to identify the most promising models. Then, we explored all possible combinations of independent models to select the most effective ensemble architecture. The final framework was built and maintained by an organization of four different deep learning models. Experimental results show that our proposed method achieves a prediction accuracy of 94.5 % and MCC of 0.890, which are nearly 4 % and 0.08 higher than ATGPred-FL, respectively. Overall, EnsembleDL-ATG is the first ATG machine learning predictor based on ensemble deep learning. The benchmark data and code utilized in this study can be accessed for free at https://github.com/jingry/autoBioSeqpy/tree/2.0/examples/EnsembleDL-ATG.
Collapse
Affiliation(s)
- Lezheng Yu
- School of Chemistry and Materials Science, Guizhou Education University, Guiyang 550018, Guizhou, China
- Basic Medical College, Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Yonglin Zhang
- Department of Pharmacy, The Affiliated Hospital of North Sichuan Medical College, Nanchong 637000, Sichuan, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou 646000, Sichuan, China
| | - Fengjuan Liu
- School of Geography and Resources, Guizhou Education University, Guiyang 550018, Guizhou, China
| | - Runyu Jing
- School of Cyber Science and Engineering, Sichuan University, Chengdu 610065, Sichuan, China
| | - Jiesi Luo
- Basic Medical College, Southwest Medical University, Luzhou 646000, Sichuan, China
- Sichuan Key Medical Laboratory of New Drug Discovery and Druggability Evaluation, Luzhou Key Laboratory of Activity Screening and Druggability Evaluation for Chinese Materia Medica, Southwest Medical University, Luzhou 646000, Sichuan, China
| |
Collapse
|
6
|
Shoombuatong W, Schaduangrat N, Nikom J. Empirical comparison and analysis of machine learning-based approaches for druggable protein identification. EXCLI JOURNAL 2023; 22:915-927. [PMID: 37780939 PMCID: PMC10539545 DOI: 10.17179/excli2023-6410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Figures] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 08/15/2023] [Indexed: 10/03/2023]
Abstract
Efficiently and precisely identifying drug targets is crucial for developing and discovering potential medications. While conventional experimental approaches can accurately pinpoint these targets, they suffer from time constraints and are not easily adaptable to high-throughput processes. On the other hand, computational approaches, particularly those utilizing machine learning (ML), offer an efficient means to accelerate the prediction of druggable proteins based solely on their primary sequences. Recently, several state-of-the-art computational methods have been developed for predicting and analyzing druggable proteins. These computational methods showed high diversity in terms of benchmark datasets, feature extraction schemes, ML algorithms, evaluation strategies and webserver/software usability. Thus, our objective is to reexamine these computational approaches and conduct a comprehensive assessment of their strengths and weaknesses across multiple aspects. In this study, we deliver the first comprehensive survey regarding the state-of-the-art computational approaches for in silico prediction of druggable proteins. First, we provided information regarding the existing benchmark datasets and the types of ML methods employed. Second, we investigated the effectiveness of these computational methods in druggable protein identification for each benchmark dataset. Third, we summarized the important features used in this field and the existing webserver/software. Finally, we addressed the present constraints of the existing methods and offer valuable guidance to the scientific community in designing and developing novel prediction models. We anticipate that this comprehensive review will provide crucial information for the development of more accurate and efficient druggable protein predictors.
Collapse
Affiliation(s)
- Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok, Thailand, 10700
| | - Jaru Nikom
- Research Methodology and Data Analytics Program, Faculty of Science & Technology, Prince of Songkla University, Pattani, Thailand, 94000
| |
Collapse
|
7
|
Karhana S, Dabral S, Garg A, Bano A, Agarwal N, Khan MA. Network pharmacology and molecular docking analysis on potential molecular targets and mechanism of action of BRAF inhibitors for application in wound healing. J Cell Biochem 2023. [PMID: 37334778 DOI: 10.1002/jcb.30430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Revised: 05/16/2023] [Accepted: 05/19/2023] [Indexed: 06/20/2023]
Abstract
Topical application of BRAF inhibitors has been shown to accelerate wound healing in murine models, which can be extrapolated into clinical applications. The aim of the study was to identify suitable pharmacological targets of BRAF inhibitors and elucidate their mechanisms of action for therapeutic applicability in wound healing, by employing bioinformatics tools including network pharmacology and molecular docking. The potential targets for BRAF inhibitors were obtained from SwissTargetPrediction, DrugBank, CTD, Therapeutic Target Database, and Binding Database. Targets of wound healing were obtained using online databases DisGeNET and OMIM (Online Mendelian Inheritance in Man). Common targets were found by using the online GeneVenn tool. Common targets were then imported to STRING to construct interaction networks. Topological parameters were assessed using Cytoscape and core targets were identified. FunRich was employed to uncover the signaling pathways, cellular components, molecular functions, and biological processes in which the core targets participate. Finally, molecular docking was performed using MOE software. Key targets for the therapeutic application of BRAF inhibitors for wound healing are peroxisome proliferator-activated receptor γ, matrix metalloproteinase 9, AKT serine/threonine kinase 1, mammalian target of rapamycin, and Ki-ras2 Kirsten rat sarcoma viral oncogene homolog. The most potent BRAF inhibitors that can be exploited for their paradoxical activity for wound healing applications are Encorafenib and Dabrafenib. By using network pharmacology and molecular docking, it can be predicted that the paradoxical activity of BRAF inhibitors can be used for their potential application in wound healing.
Collapse
Affiliation(s)
- Sonali Karhana
- Centre for Translational & Clinical Research, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India
| | - Swarna Dabral
- Centre for Translational & Clinical Research, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India
- Department of Pharmacology, School of Pharmaceutical Education and Research, Jamia Hamdard, New Delhi, India
| | - Aakriti Garg
- Centre for Translational & Clinical Research, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India
- Department of Pharmacology, School of Pharmaceutical Education and Research, Jamia Hamdard, New Delhi, India
| | - Aysha Bano
- Centre for Translational & Clinical Research, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India
| | - Nidhi Agarwal
- Centre for Translational & Clinical Research, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India
| | - Mohd Ashif Khan
- Centre for Translational & Clinical Research, School of Chemical and Life Sciences, Jamia Hamdard, New Delhi, India
| |
Collapse
|
8
|
Zhang Y, Yu L, Jing R, Han B, Luo J. Fast and Efficient Design of Deep Neural Networks for Predicting N 7-Methylguanosine Sites Using autoBioSeqpy. ACS OMEGA 2023; 8:19728-19740. [PMID: 37305295 PMCID: PMC10249100 DOI: 10.1021/acsomega.3c01371] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Accepted: 05/10/2023] [Indexed: 06/13/2023]
Abstract
N7-Methylguanosine (m7G) is a crucial post-transcriptional RNA modification that plays a pivotal role in regulating gene expression. Accurately identifying m7G sites is a fundamental step in understanding the biological functions and regulatory mechanisms associated with this modification. While whole-genome sequencing is the gold standard for RNA modification site detection, it is a time-consuming, expensive, and intricate process. Recently, computational approaches, especially deep learning (DL) techniques, have gained popularity in achieving this objective. Convolutional neural networks and recurrent neural networks are examples of DL algorithms that have emerged as versatile tools for modeling biological sequence data. However, developing an efficient network architecture with superior performance remains a challenging task, requiring significant expertise, time, and effort. To address this, we previously introduced a tool called autoBioSeqpy, which streamlines the design and implementation of DL networks for biological sequence classification. In this study, we utilized autoBioSeqpy to develop, train, evaluate, and fine-tune sequence-level DL models for predicting m7G sites. We provided detailed descriptions of these models, along with a step-by-step guide on their execution. The same methodology can be applied to other systems dealing with similar biological questions. The benchmark data and code utilized in this study can be accessed for free at http://github.com/jingry/autoBioSeeqpy/tree/2.0/examples/m7G.
Collapse
Affiliation(s)
- Yonglin Zhang
- Department
of Pharmacy, Affiliated Hospital of North
Sichuan Medical College, Nanchong 637000, China
| | - Lezheng Yu
- School
of Chemistry and Materials Science, Guizhou
Education University, Guiyang 550024, China
| | - Runyu Jing
- School
of Cyber Science and Engineering, Sichuan
University, Chengdu 610017, China
| | - Bin Han
- GCP
Center/Institute of Drug Clinical Trials, Affiliated Hospital of North Sichuan Medical College, Nanchong 637503, China
| | - Jiesi Luo
- Basic
Medical College, Southwest Medical University, Luzhou 646099, Sichuan, China
- Key
Medical
Laboratory of New Drug Discovery and Druggability Evaluation, Luzhou
Key Laboratory of Activity Screening and Druggability Evaluation for
Chinese Materia Medica, Southwest Medical
University, Luzhou 646099, China
| |
Collapse
|
9
|
Monti A, Vitagliano L, Caporale A, Ruvo M, Doti N. Targeting Protein-Protein Interfaces with Peptides: The Contribution of Chemical Combinatorial Peptide Library Approaches. Int J Mol Sci 2023; 24:ijms24097842. [PMID: 37175549 PMCID: PMC10178479 DOI: 10.3390/ijms24097842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 04/22/2023] [Accepted: 04/23/2023] [Indexed: 05/15/2023] Open
Abstract
Protein-protein interfaces play fundamental roles in the molecular mechanisms underlying pathophysiological pathways and are important targets for the design of compounds of therapeutic interest. However, the identification of binding sites on protein surfaces and the development of modulators of protein-protein interactions still represent a major challenge due to their highly dynamic and extensive interfacial areas. Over the years, multiple strategies including structural, computational, and combinatorial approaches have been developed to characterize PPI and to date, several successful examples of small molecules, antibodies, peptides, and aptamers able to modulate these interfaces have been determined. Notably, peptides are a particularly useful tool for inhibiting PPIs due to their exquisite potency, specificity, and selectivity. Here, after an overview of PPIs and of the commonly used approaches to identify and characterize them, we describe and evaluate the impact of chemical peptide libraries in medicinal chemistry with a special focus on the results achieved through recent applications of this methodology. Finally, we also discuss the role that this methodology can have in the framework of the opportunities, and challenges that the application of new predictive approaches based on artificial intelligence is generating in structural biology.
Collapse
Affiliation(s)
- Alessandra Monti
- Institute of Biostructures and Bioimaging (IBB), National Research Council (CNR), 80131 Napoli, Italy
| | - Luigi Vitagliano
- Institute of Biostructures and Bioimaging (IBB), National Research Council (CNR), 80131 Napoli, Italy
| | - Andrea Caporale
- Institute of Crystallography (IC), National Research Council (CNR), Strada Statale 14 km 163.5, Basovizza, 34149 Triese, Italy
| | - Menotti Ruvo
- Institute of Biostructures and Bioimaging (IBB), National Research Council (CNR), 80131 Napoli, Italy
| | - Nunzianna Doti
- Institute of Biostructures and Bioimaging (IBB), National Research Council (CNR), 80131 Napoli, Italy
| |
Collapse
|
10
|
Chen J, Gu Z, Xu Y, Deng M, Lai L, Pei J. QuoteTarget: A sequence-based transformer protein language model to identify potentially druggable protein targets. Protein Sci 2023; 32:e4555. [PMID: 36564866 PMCID: PMC9878469 DOI: 10.1002/pro.4555] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Revised: 12/16/2022] [Accepted: 12/20/2022] [Indexed: 12/25/2022]
Abstract
The development of efficient computational methods for drug target protein identification can compensate for the high cost of experiments and is therefore of great significance for drug development. However, existing structure-based drug target protein-identification algorithms are limited by the insufficient number of proteins with experimentally resolved structures. Moreover, sequence-based algorithms cannot effectively extract information from protein sequences and thus display insufficient accuracy. Here, we combined the sequence-based self-supervised pretraining protein language model ESM1b with a graph convolutional neural network classifier to develop an improved, sequence-based drug target protein identification method. This complete model, named QuoteTarget, efficiently encodes proteins based on sequence information alone and achieves an accuracy of 95% with the nonredundant drug target and nondrug target datasets constructed for this study. When applied to all proteins from Homo sapiens, QuoteTarget identified 1213 potential undeveloped drug target proteins. We further inferred residue-binding weights from the well-trained network using the gradient-weighted class activation mapping (Grad-Cam) algorithm. Notably, we found that without any binding site information input, significant residues inferred by the model closely match the experimentally confirmed drug molecule-binding sites. Thus, our work provides a highly effective sequence-based identifier for drug target proteins, as well to yield new insights into recognizing drug molecule-binding sites. The entire model is available at https://github.com/Chenjxjx/drug-target-prediction.
Collapse
Affiliation(s)
- Jiaxiao Chen
- Center for Quantitative BiologyAcademy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
| | - Zhonghui Gu
- Peking‐Tsinghua Center for Life SciencesAcademy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
| | - Youjun Xu
- Infinite Intelligence PharmaBeijingChina
| | - Minghua Deng
- Center for Quantitative BiologyAcademy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
- School of Mathematical SciencesPeking UniversityBeijingChina
- Center for Statistical SciencePeking UniversityBeijingChina
| | - Luhua Lai
- Center for Quantitative BiologyAcademy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
- Peking‐Tsinghua Center for Life SciencesAcademy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
- BNLMS, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
- Research Unit of Drug Design MethodChinese Academy of Medical SciencesBeijingChina
| | - Jianfeng Pei
- Center for Quantitative BiologyAcademy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
- Research Unit of Drug Design MethodChinese Academy of Medical SciencesBeijingChina
| |
Collapse
|
11
|
Iraji MS, Tanha J, Habibinejad M. Druggable protein prediction using a multi-canal deep convolutional neural network based on autocovariance method. Comput Biol Med 2022; 151:106276. [PMID: 36410099 DOI: 10.1016/j.compbiomed.2022.106276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Revised: 10/18/2022] [Accepted: 10/30/2022] [Indexed: 11/09/2022]
Abstract
Drug targets must be identified and positioned correctly to research and manufacture new drugs. In this study, rather than using traditional methods for drug expansion, the drug target is determined using machine learning. Machine learning has generated significant interest and desire in recent years and extensive research due to its low cost and speed of operation. As a result, it is critical to develop an intelligent classification system for drug proteins. This study proposes two distinct models for the prediction of druggable protein classes based on the deep learning method. The translation of drug-protein sequences is based on six physicochemical properties of amino acids. Following the application of the autocovariance method, converted sequences are used as fixed-length input vectors in deep stacked sparse auto-encoders (DSSAEs) network. The coded protein sequences are also considered and utilized as a six-channel input vector for the deep convolutional neural network model. The experimental results contributing to the deep convolution model are more efficient than previous studies for classifying druggable proteins. The proposed approach achieved a sensitivity of 96.92%, a specificity of 99.51%, and an accuracy of 98.29%.
Collapse
Affiliation(s)
- Mohammad Saber Iraji
- Department of Computer Engineering and Information Technology, Payame Noor University, Tehran, Iran; Department of Computer Engineering, University of Tabriz, Tabriz, Iran.
| | - Jafar Tanha
- Department of Computer Engineering, University of Tabriz, Tabriz, Iran
| | - Mahboobeh Habibinejad
- Department of Computer Engineering and Information Technology, Payame Noor University, Tehran, Iran
| |
Collapse
|
12
|
Data-driven analysis and druggability assessment methods to accelerate the identification of novel cancer targets. Comput Struct Biotechnol J 2022; 21:46-57. [PMID: 36514341 PMCID: PMC9732000 DOI: 10.1016/j.csbj.2022.11.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 11/21/2022] [Accepted: 11/21/2022] [Indexed: 11/27/2022] Open
Abstract
Over the past few decades, drug discovery has greatly improved the outcomes for patients, but several challenges continue to hinder the rapid development of novel drugs. Addressing unmet clinical needs requires the pursuit of drug targets that have a higher likelihood to lead to the development of successful drugs. Here we describe a bioinformatic approach for identifying novel cancer drug targets by performing statistical analysis to ascertain quantitative changes in expression levels between protein-coding genes, as well as co-expression networks to classify these genes into groups. Subsequently, we provide an overview of druggability assessment methodologies to prioritize and select the best targets to pursue.
Collapse
|
13
|
Raies A, Tulodziecka E, Stainer J, Middleton L, Dhindsa RS, Hill P, Engkvist O, Harper AR, Petrovski S, Vitsios D. DrugnomeAI is an ensemble machine-learning framework for predicting druggability of candidate drug targets. Commun Biol 2022; 5:1291. [PMID: 36434048 PMCID: PMC9700683 DOI: 10.1038/s42003-022-04245-4] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 11/09/2022] [Indexed: 11/27/2022] Open
Abstract
The druggability of targets is a crucial consideration in drug target selection. Here, we adopt a stochastic semi-supervised ML framework to develop DrugnomeAI, which estimates the druggability likelihood for every protein-coding gene in the human exome. DrugnomeAI integrates gene-level properties from 15 sources resulting in 324 features. The tool generates exome-wide predictions based on labelled sets of known drug targets (median AUC: 0.97), highlighting features from protein-protein interaction networks as top predictors. DrugnomeAI provides generic as well as specialised models stratified by disease type or drug therapeutic modality. The top-ranking DrugnomeAI genes were significantly enriched for genes previously selected for clinical development programs (p value < 1 × 10-308) and for genes achieving genome-wide significance in phenome-wide association studies of 450 K UK Biobank exomes for binary (p value = 1.7 × 10-5) and quantitative traits (p value = 1.6 × 10-7). We accompany our method with a web application ( http://drugnomeai.public.cgr.astrazeneca.com ) to visualise the druggability predictions and the key features that define gene druggability, per disease type and modality.
Collapse
Affiliation(s)
- Arwa Raies
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Ewa Tulodziecka
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - James Stainer
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Lawrence Middleton
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Ryan S Dhindsa
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Waltham, MA, USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, USA
| | - Pamela Hill
- Emerging Innovations, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Waltham, MA, USA
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Andrew R Harper
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
| | - Slavé Petrovski
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
- Department of Medicine, University of Melbourne, Austin Health, Melbourne, VIC, Australia
| | - Dimitrios Vitsios
- Centre for Genomics Research, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK.
| |
Collapse
|
14
|
Computational prediction and interpretation of druggable proteins using a stacked ensemble-learning framework. iScience 2022; 25:104883. [PMID: 36046193 PMCID: PMC9421381 DOI: 10.1016/j.isci.2022.104883] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2022] [Revised: 07/08/2022] [Accepted: 08/02/2022] [Indexed: 11/22/2022] Open
Abstract
Discovery of potential drugs requires rapid and precise identification of drug targets. Although traditional experimental methodologies can accurately identify drug targets, they are time-consuming and inappropriate for high-throughput screening. Computational approaches based on machine learning (ML) algorithms can expedite the prediction of druggable proteins; however, the performance of the existing computational methods remains unsatisfactory. This study proposes a computational tool, SPIDER, to enhance the accurate prediction of druggable proteins. SPIDER employs various feature descriptors pertaining to several aspects, including physicochemical properties, compositional information, and composition-transition-distribution information, coupled with well-known ML algorithms to facilitate the construction of the final meta-predictor. The experimental results showed that SPIDER enabled more precise and robust prediction of druggable proteins than the baseline models and current existing methods in terms of the independent test dataset. An online web server was established and made freely available online. Computational models can expedite the identification of potential druggable proteins SPIDER represents the first stacked model proposed for druggable protein prediction SPIDER enables more precise prediction of druggable proteins than existing methods The SPIDER web server is available at http://pmlabstack.pythonanywhere.com/SPIDER.
Collapse
|