1
|
Choudhary P, Kunnakkattu IR, Nair S, Lawal DK, Pidruchna I, Afonso MQL, Fleming JR, Velankar S. PDBe tools for an in-depth analysis of small molecules in the Protein Data Bank. Protein Sci 2025; 34:e70084. [PMID: 40100137 PMCID: PMC11917123 DOI: 10.1002/pro.70084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Revised: 01/27/2025] [Accepted: 02/12/2025] [Indexed: 03/20/2025]
Abstract
The Protein Data Bank (PDB) is the primary global repository for experimentally determined 3D structures of biological macromolecules and their complexes with ligands, proteins, and nucleic acids. PDB contains over 47,000 unique small molecules bound to the macromolecules. Despite the extensive data available, the complexity of small-molecule data in the PDB necessitates specialized tools for effective analysis and visualization. PDBe has developed a number of tools, including PDBe CCDUtils (https://github.com/PDBeurope/ccdutils) for accessing and enriching ligand data, PDBe Arpeggio (https://github.com/PDBeurope/arpeggio) for analyzing interactions between ligands and macromolecules, and PDBe RelLig (https://github.com/PDBeurope/rellig) for identifying the functional roles of ligands (such as reactants, cofactors, or drug-like molecules) within protein-ligand complexes. The enhanced ligand annotations and data generated by these tools are presented on the novel PDBe-KB ligand pages, offering a comprehensive overview of small molecules and providing valuable insights into their biological contexts (example page for Imatinib: https://pdbe.org/chem/sti). By improving the standardization of ligand identification, adding various annotations, and offering advanced visualization capabilities, these tools help researchers navigate the complexities of small molecules and their roles in biological systems, facilitating mechanistic understanding of biological functions. The ongoing enhancements to these resources are designed to support the scientific community in gaining valuable insights into ligands and their applications across various fields, including drug discovery, molecular biology, systems biology, structural biology, and pharmacology.
Collapse
Affiliation(s)
- Preeti Choudhary
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Ibrahim Roshan Kunnakkattu
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Sreenath Nair
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Dare Kayode Lawal
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Ivanna Pidruchna
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Marcelo Querino Lima Afonso
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Jennifer R Fleming
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, UK
| |
Collapse
|
2
|
Zheng S, Zhang C, Chen Y, Chen M. Graph and Multi-Level Sequence Fusion Learning for Predicting the Molecular Activity of BACE-1 Inhibitors. Int J Mol Sci 2025; 26:1681. [PMID: 40004143 PMCID: PMC11855840 DOI: 10.3390/ijms26041681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/12/2025] [Accepted: 02/14/2025] [Indexed: 02/27/2025] Open
Abstract
The development of BACE-1 (β-site amyloid precursor protein cleaving enzyme 1) inhibitors is a crucial focus in exploring early treatments for Alzheimer's disease (AD). Recently, graph neural networks (GNNs) have demonstrated significant advantages in predicting molecular activity. However, their reliance on graph structures alone often neglects explicit sequence-level semantic information. To address this limitation, we proposed a Graph and multi-level Sequence Fusion Learning (GSFL) model for predicting the molecular activity of BACE-1 inhibitors. Firstly, molecular graph structures generated from SMILES strings were encoded using GNNs with an atomic-level characteristic attention mechanism. Next, substrings at functional group, ion level, and atomic level substrings were extracted from SMILES strings and encoded using a BiLSTM-Transformer framework equipped with a hierarchical attention mechanism. Finally, these features were fused to predict the activity of BACE-1 inhibitors. A dataset of 1548 compounds with BACE-1 activity measurements was curated from the ChEMBL database. In the classification experiment, the model achieved an accuracy of 0.941 on the training set and 0.877 on the test set. For the test set, it delivered a sensitivity of 0.852, a specificity of 0.894, a MCC of 0.744, an F1-score of 0.872, a PRC of 0.869, and an AUC of 0.915. Compared to traditional computer-aided drug design methods and other machine learning algorithms, the proposed model can effectively improve the accuracy of the molecular activity prediction of BACE-1 inhibitors and has a potential application value.
Collapse
Affiliation(s)
- Shaohua Zheng
- College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
| | - Changwang Zhang
- College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
| | - Youjia Chen
- College of Physics and Information Engineering, Fuzhou University, Fuzhou 350108, China
| | - Meimei Chen
- College of Traditional Chinese Medicine, Fujian University of Traditional Chinese Medicine, Fuzhou 350122, China
| |
Collapse
|
3
|
McNaughton A, Sankar Ramalaxmi GK, Kruel A, Knutson CR, Varikoti RA, Kumar N. CACTUS: Chemistry Agent Connecting Tool Usage to Science. ACS OMEGA 2024; 9:46563-46573. [PMID: 39583666 PMCID: PMC11579734 DOI: 10.1021/acsomega.4c08408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 10/08/2024] [Accepted: 10/14/2024] [Indexed: 11/26/2024]
Abstract
Large language models (LLMs) have shown remarkable potential in various domains but often lack the ability to access and reason over domain-specific knowledge and tools. In this article, we introduce Chemistry Agent Connecting Tool-Usage to Science (CACTUS), an LLM-based agent that integrates existing cheminformatics tools to enable accurate and advanced reasoning and problem-solving in chemistry and molecular discovery. We evaluate the performance of CACTUS using a diverse set of open-source LLMs, including Gemma-7b, Falcon-7b, MPT-7b, Llama3-8b, and Mistral-7b, on a benchmark of thousands of chemistry questions. Our results demonstrate that CACTUS significantly outperforms baseline LLMs, with the Gemma-7b, Mistral-7b, and Llama3-8b models achieving the highest accuracy regardless of the prompting strategy used. Moreover, we explore the impact of domain-specific prompting and hardware configurations on model performance, highlighting the importance of prompt engineering and the potential for deploying smaller models on consumer-grade hardware without a significant loss in accuracy. By combining the cognitive capabilities of open-source LLMs with widely used domain-specific tools provided by RDKit, CACTUS can assist researchers in tasks such as molecular property prediction, similarity searching, and drug-likeness assessment.
Collapse
Affiliation(s)
- Andrew
D. McNaughton
- Pacific Northwest National Laboratory, Richland, Washington 99354, United States
| | | | - Agustin Kruel
- Pacific Northwest National Laboratory, Richland, Washington 99354, United States
| | - Carter R. Knutson
- Pacific Northwest National Laboratory, Richland, Washington 99354, United States
| | - Rohith A. Varikoti
- Pacific Northwest National Laboratory, Richland, Washington 99354, United States
| | - Neeraj Kumar
- Pacific Northwest National Laboratory, Richland, Washington 99354, United States
| |
Collapse
|
4
|
Patne AY, Dhulipala SM, Lawless W, Prakash S, Mohapatra SS, Mohapatra S. Drug Discovery in the Age of Artificial Intelligence: Transformative Target-Based Approaches. Int J Mol Sci 2024; 25:12233. [PMID: 39596300 PMCID: PMC11594879 DOI: 10.3390/ijms252212233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 11/01/2024] [Accepted: 11/06/2024] [Indexed: 11/28/2024] Open
Abstract
The complexities inherent in drug development are multi-faceted and often hamper accuracy, speed and efficiency, thereby limiting success. This review explores how recent developments in machine learning (ML) are significantly impacting target-based drug discovery, particularly in small-molecule approaches. The Simplified Molecular Input Line Entry System (SMILES), which translates a chemical compound's three-dimensional structure into a string of symbols, is now widely used in drug design, mining, and repurposing. Utilizing ML and natural language processing techniques, SMILES has revolutionized lead identification, high-throughput screening and virtual screening. ML models enhance the accuracy of predicting binding affinity and selectivity, reducing the need for extensive experimental screening. Additionally, deep learning, with its strengths in analyzing spatial and sequential data through convolutional neural networks (CNNs) and recurrent neural networks (RNNs), shows promise for virtual screening, target identification, and de novo drug design. Fragment-based approaches also benefit from ML algorithms and techniques like generative adversarial networks (GANs), which predict fragment properties and binding affinities, aiding in hit selection and design optimization. Structure-based drug design, which relies on high-resolution protein structures, leverages ML models for accurate predictions of binding interactions. While challenges such as interpretability and data quality remain, ML's transformative impact accelerates target-based drug discovery, increasing efficiency and innovation. Its potential to deliver new and improved treatments for various diseases is significant.
Collapse
Affiliation(s)
- Akshata Yashwant Patne
- Center for Research and Education in Nanobioengineering, Department of Internal Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA;
- Taneja College of Pharmacy Graduate Programs, MDC30, 12908 USF Health Drive, Tampa, FL 33612, USA
| | - Sai Madhav Dhulipala
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA; (S.M.D.); (W.L.)
| | - William Lawless
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA; (S.M.D.); (W.L.)
- Research Service, James A. Haley Veterans Hospital, Tampa, FL 33612, USA
| | - Satya Prakash
- Biomedical Technology and Cell Therapy Research Laboratory, Department of Biomedical Engineering, Faculty of Medicine and Health Sciences, McGill University, 3775 University Street, Montreal, QC H3A 2B4, Canada;
| | - Shyam S. Mohapatra
- Center for Research and Education in Nanobioengineering, Department of Internal Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA;
- Taneja College of Pharmacy Graduate Programs, MDC30, 12908 USF Health Drive, Tampa, FL 33612, USA
- Research Service, James A. Haley Veterans Hospital, Tampa, FL 33612, USA
| | - Subhra Mohapatra
- Center for Research and Education in Nanobioengineering, Department of Internal Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA;
- Taneja College of Pharmacy Graduate Programs, MDC30, 12908 USF Health Drive, Tampa, FL 33612, USA
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, Tampa, FL 33612, USA; (S.M.D.); (W.L.)
- Research Service, James A. Haley Veterans Hospital, Tampa, FL 33612, USA
| |
Collapse
|