1
|
Qi J, Feng C, Shi Y, Yang J, Zhang F, Li G, Han R. FP-Zernike: An Open-source Structural Database Construction Toolkit for Fast Structure Retrieval. GENOMICS, PROTEOMICS & BIOINFORMATICS 2024; 22:qzae007. [PMID: 38894604 PMCID: PMC11423855 DOI: 10.1093/gpbjnl/qzae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 08/16/2023] [Accepted: 09/20/2023] [Indexed: 06/21/2024]
Abstract
The release of AlphaFold2 has sparked a rapid expansion in protein model databases. Efficient protein structure retrieval is crucial for the analysis of structure models, while measuring the similarity between structures is the key challenge in structural retrieval. Although existing structure alignment algorithms can address this challenge, they are often time-consuming. Currently, the state-of-the-art approach involves converting protein structures into three-dimensional (3D) Zernike descriptors and assessing similarity using Euclidean distance. However, the methods for computing 3D Zernike descriptors mainly rely on structural surfaces and are predominantly web-based, thus limiting their application in studying custom datasets. To overcome this limitation, we developed FP-Zernike, a user-friendly toolkit for computing different types of Zernike descriptors based on feature points. Users simply need to enter a single line of command to calculate the Zernike descriptors of all structures in customized datasets. FP-Zernike outperforms the leading method in terms of retrieval accuracy and binary classification accuracy across diverse benchmark datasets. In addition, we showed the application of FP-Zernike in the construction of the descriptor database and the protocol used for the Protein Data Bank (PDB) dataset to facilitate the local deployment of this tool for interested readers. Our demonstration contained 590,685 structures, and at this scale, our system required only 4-9 s to complete a retrieval. The experiments confirmed that it achieved the state-of-the-art accuracy level. FP-Zernike is an open-source toolkit, with the source code and related data accessible at https://ngdc.cncb.ac.cn/biocode/tools/BT007365/releases/0.1, as well as through a webserver at http://www.structbioinfo.cn/.
Collapse
Affiliation(s)
- Junhai Qi
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
- BioMap Research, Menlo Park, CA 94025, USA
| | - Chenjie Feng
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
- College of Medical Information and Engineering, Ningxia Medical University, Yinchuan 750004, China
| | - Yulin Shi
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Jianyi Yang
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Fa Zhang
- Institute of Engineering Medicine, Beijing Institute of Technology, Beijing 100081, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
2
|
Eguida M, Rognan D. Estimating the Similarity between Protein Pockets. Int J Mol Sci 2022; 23:12462. [PMID: 36293316 PMCID: PMC9604425 DOI: 10.3390/ijms232012462] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 10/15/2022] [Accepted: 10/16/2022] [Indexed: 10/28/2023] Open
Abstract
With the exponential increase in publicly available protein structures, the comparison of protein binding sites naturally emerged as a scientific topic to explain observations or generate hypotheses for ligand design, notably to predict ligand selectivity for on- and off-targets, explain polypharmacology, and design target-focused libraries. The current review summarizes the state-of-the-art computational methods applied to pocket detection and comparison as well as structural druggability estimates. The major strengths and weaknesses of current pocket descriptors, alignment methods, and similarity search algorithms are presented. Lastly, an exhaustive survey of both retrospective and prospective applications in diverse medicinal chemistry scenarios illustrates the capability of the existing methods and the hurdle that still needs to be overcome for more accurate predictions.
Collapse
Affiliation(s)
| | - Didier Rognan
- Laboratoire d’Innovation Thérapeutique, UMR7200 CNRS-Université de Strasbourg, 67400 Illkirch, France
| |
Collapse
|
3
|
A mathematical representation of protein binding sites using structural dispersion of atoms from principal axes for classification of binding ligands. PLoS One 2021; 16:e0244905. [PMID: 33831020 PMCID: PMC8031081 DOI: 10.1371/journal.pone.0244905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Accepted: 03/09/2021] [Indexed: 11/23/2022] Open
Abstract
Many researchers have studied the relationship between the biological functions of proteins and the structures of both their overall backbones of amino acids and their binding sites. A large amount of the work has focused on summarizing structural features of binding sites as scalar quantities, which can result in a great deal of information loss since the structures are three-dimensional. Additionally, a common way of comparing binding sites is via aligning their atoms, which is a computationally intensive procedure that substantially limits the types of analysis and modeling that can be done. In this work, we develop a novel encoding of binding sites as covariance matrices of the distances of atoms to the principal axes of the structures. This representation is invariant to the chosen coordinate system for the atoms in the binding sites, which removes the need to align the sites to a common coordinate system, is computationally efficient, and permits the development of probability models. These can then be used to both better understand groups of binding sites that bind to the same ligand and perform classification for these ligand groups. We demonstrate the utility of our method for discrimination of binding ligand through classification studies with two benchmark datasets using nearest mean and polytomous logistic regression classifiers.
Collapse
|
4
|
Georgiev GD, Dodd KF, Chen BY. Precise parallel volumetric comparison of molecular surfaces and electrostatic isopotentials. Algorithms Mol Biol 2020; 15:11. [PMID: 32489400 PMCID: PMC7247173 DOI: 10.1186/s13015-020-00168-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2019] [Accepted: 04/15/2020] [Indexed: 11/10/2022] Open
Abstract
Geometric comparisons of binding sites and their electrostatic properties can identify subtle variations that select different binding partners and subtle similarities that accommodate similar partners. Because subtle features are central for explaining how proteins achieve specificity, algorithmic efficiency and geometric precision are central to algorithmic design. To address these concerns, this paper presents pClay, the first algorithm to perform parallel and arbitrarily precise comparisons of molecular surfaces and electrostatic isopotentials as geometric solids. pClay was presented at the 2019 Workshop on Algorithms in Bioinformatics (WABI 2019) and is described in expanded detail here, especially with regard to the comparison of electrostatic isopotentials. Earlier methods have generally used parallelism to enhance computational throughput, pClay is the first algorithm to use parallelism to make arbitrarily high precision comparisons practical. It is also the first method to demonstrate that high precision comparisons of geometric solids can yield more precise structural inferences than algorithms that use existing standards of precision. One advantage of added precision is that statistical models can be trained with more accurate data. Using structural data from an existing method, a model of steric variations between binding cavities can overlook 53% of authentic steric influences on specificity, whereas a model trained with data from pClay overlooks none. Our results also demonstrate the parallel performance of pClay on both workstation CPUs and a 61-core Xeon Phi. While slower on one core, additional processor cores rapidly outpaced single core performance and existing methods. Based on these results, it is clear that pClay has applications in the automatic explanation of binding mechanisms and in the rational design of protein binding preferences.
Collapse
Affiliation(s)
- Georgi D. Georgiev
- Department of Computer Science and Engineering, Lehigh University, 113 Research Drive, Bethlehem, PA USA
| | - Kevin F. Dodd
- Department of Computer Science and Engineering, Lehigh University, 113 Research Drive, Bethlehem, PA USA
| | - Brian Y. Chen
- Department of Computer Science and Engineering, Lehigh University, 113 Research Drive, Bethlehem, PA USA
| |
Collapse
|
5
|
Zhang Y, Sui X, Stagg S, Zhang J. FTIP: an accurate and efficient method for global protein surface comparison. Bioinformatics 2020; 36:3056-3063. [PMID: 32022843 DOI: 10.1093/bioinformatics/btaa076] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Revised: 01/16/2020] [Accepted: 01/28/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Global protein surface comparison (GPSC) studies have been limited compared to other research works on protein structure alignment/comparison due to lack of real applications associated with GPSC. However, the technology advances in cryo-electron tomography (CET) have made methods to identify proteins from their surface shapes extremely useful. RESULTS In this study, we developed a new method called Farthest point sampling (FPS)-enhanced Triangulation-based Iterative-closest-Point (ICP) (FTIP) for GPSC. We applied it to protein classification using only surface shape information. Our method first extracts a set of feature points from protein surfaces using FPS and then uses a triangulation-based efficient ICP algorithm to align the feature points of the two proteins to be compared. Tested on a benchmark dataset with 2329 proteins using nearest-neighbor classification, FTIP outperformed the state-of-the-art method for GPSC based on 3D Zernike descriptors. Using real and simulated cryo-EM data, we show that FTIP could be applied in the future to address problems in protein identification in CET experiments. AVAILABILITY AND IMPLEMENTATION Programs/scripts we developed/used in the study are available at http://ani.stat.fsu.edu/∼yuan/index.fld/FTIP.tar.bz2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Scott Stagg
- Department of Chemistry, Florida State University, Tallahassee, FL 32306, USA
| | | |
Collapse
|
6
|
Budowski-Tal I, Kolodny R, Mandel-Gutfreund Y. A Novel Geometry-Based Approach to Infer Protein Interface Similarity. Sci Rep 2018; 8:8192. [PMID: 29844500 PMCID: PMC5974305 DOI: 10.1038/s41598-018-26497-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Accepted: 05/10/2018] [Indexed: 11/21/2022] Open
Abstract
The protein interface is key to understand protein function, providing a vital insight on how proteins interact with each other and with other molecules. Over the years, many computational methods to compare protein structures were developed, yet evaluating interface similarity remains a very difficult task. Here, we present PatchBag – a geometry based method for efficient comparison of protein surfaces and interfaces. PatchBag is a Bag-Of-Words approach, which represents complex objects as vectors, enabling to search interface similarity in a highly efficient manner. Using a novel framework for evaluating interface similarity, we show that PatchBag performance is comparable to state-of-the-art alignment-based structural comparison methods. The great advantage of PatchBag is that it does not rely on sequence or fold information, thus enabling to detect similarities between interfaces in unrelated proteins. We propose that PatchBag can contribute to reveal novel evolutionary and functional relationships between protein interfaces.
Collapse
Affiliation(s)
- Inbal Budowski-Tal
- Faculty of Biology, Technion, Israel Institute of Technology, Haifa, 3200003, Israel.,Department of Computer Science, University of Haifa, Mount Carmel, Haifa, 3498838, Israel
| | - Rachel Kolodny
- Department of Computer Science, University of Haifa, Mount Carmel, Haifa, 3498838, Israel.
| | - Yael Mandel-Gutfreund
- Faculty of Biology, Technion, Israel Institute of Technology, Haifa, 3200003, Israel.
| |
Collapse
|
7
|
Zhang J, Chai H, Gao B, Yang G, Ma Z. HEMEsPred: Structure-Based Ligand-Specific Heme Binding Residues Prediction by Using Fast-Adaptive Ensemble Learning Scheme. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:147-156. [PMID: 28029626 DOI: 10.1109/tcbb.2016.2615010] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Heme is an essential biomolecule that widely exists in numerous extant organisms. Accurately identifying heme binding residues (HEMEs) is of great importance in disease progression and drug development. In this study, a novel predictor named HEMEsPred was proposed for predicting HEMEs. First, several sequence- and structure-based features, including amino acid composition, motifs, surface preferences, and secondary structure, were collected to construct feature matrices. Second, a novel fast-adaptive ensemble learning scheme was designed to overcome the serious class-imbalance problem as well as to enhance the prediction performance. Third, we further developed ligand-specific models considering that different heme ligands varied significantly in their roles, sizes, and distributions. Statistical test proved the effectiveness of ligand-specific models. Experimental results on benchmark datasets demonstrated good robustness of our proposed method. Furthermore, our method also showed good generalization capability and outperformed many state-of-art predictors on two independent testing datasets. HEMEsPred web server was available at http://www.inforstation.com/HEMEsPred/ for free academic use.
Collapse
|
8
|
Rasolohery I, Moroy G, Guyon F. PatchSearch: A Fast Computational Method for Off-Target Detection. J Chem Inf Model 2017; 57:769-777. [PMID: 28282119 DOI: 10.1021/acs.jcim.6b00529] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Many therapeutic molecules are known to bind several proteins, which can be different from the initially targeted one. Such unexpected interactions with proteins called off-targets can lead to adverse effects. Potential off-target identification is important to predict to avoid drug side effects or to discover new targets for existing drugs. We propose a new program named PatchSearch that implements local nonsequential searching for similar binding sites on protein surfaces with a controlled amount of flexibility. It is based on detection of quasi-cliques in product graphs representing all the possible matchings between two compared structures. This method has been benchmarked on a large diversity of ligands and on five data sets ranging from 12 to more than 7000 protein structures. The experiments conducted in this study show that the PatchSearch method could be useful in the early identification of off-targets. The program and the benchmarks presented in this paper are available as an R package at https://github.com/MTiPatchSearch .
Collapse
Affiliation(s)
- Inès Rasolohery
- Molécules Thérapeutiques in Silico, UMRS 973, Université Paris Diderot, INSERM , F-75013 Paris, France
| | - Gautier Moroy
- Molécules Thérapeutiques in Silico, UMRS 973, Université Paris Diderot, INSERM , F-75013 Paris, France
| | - Frédéric Guyon
- Molécules Thérapeutiques in Silico, UMRS 973, Université Paris Diderot, INSERM , F-75013 Paris, France
| |
Collapse
|
9
|
Brylinski M. eMatchSite: sequence order-independent structure alignments of ligand binding pockets in protein models. PLoS Comput Biol 2014; 10:e1003829. [PMID: 25232727 PMCID: PMC4168975 DOI: 10.1371/journal.pcbi.1003829] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2014] [Accepted: 07/24/2014] [Indexed: 01/26/2023] Open
Abstract
Detecting similarities between ligand binding sites in the absence of global homology between target proteins has been recognized as one of the critical components of modern drug discovery. Local binding site alignments can be constructed using sequence order-independent techniques, however, to achieve a high accuracy, many current algorithms for binding site comparison require high-quality experimental protein structures, preferably in the bound conformational state. This, in turn, complicates proteome scale applications, where only various quality structure models are available for the majority of gene products. To improve the state-of-the-art, we developed eMatchSite, a new method for constructing sequence order-independent alignments of ligand binding sites in protein models. Large-scale benchmarking calculations using adenine-binding pockets in crystal structures demonstrate that eMatchSite generates accurate alignments for almost three times more protein pairs than SOIPPA. More importantly, eMatchSite offers a high tolerance to structural distortions in ligand binding regions in protein models. For example, the percentage of correctly aligned pairs of adenine-binding sites in weakly homologous protein models is only 4–9% lower than those aligned using crystal structures. This represents a significant improvement over other algorithms, e.g. the performance of eMatchSite in recognizing similar binding sites is 6% and 13% higher than that of SiteEngine using high- and moderate-quality protein models, respectively. Constructing biologically correct alignments using predicted ligand binding sites in protein models opens up the possibility to investigate drug-protein interaction networks for complete proteomes with prospective systems-level applications in polypharmacology and rational drug repositioning. eMatchSite is freely available to the academic community as a web-server and a stand-alone software distribution at http://www.brylinski.org/ematchsite.
Collapse
Affiliation(s)
- Michal Brylinski
- Department of Biological Sciences, Louisiana State University, Baton Rouge, Louisiana, United States of America
- Center for Computation & Technology, Louisiana State University, Baton Rouge, Louisiana, United States of America
- * E-mail:
| |
Collapse
|
10
|
Allen WJ, Rizzo RC. Implementation of the Hungarian algorithm to account for ligand symmetry and similarity in structure-based design. J Chem Inf Model 2014; 54:518-29. [PMID: 24410429 PMCID: PMC3958141 DOI: 10.1021/ci400534h] [Citation(s) in RCA: 63] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]
Abstract
![]()
False
negative docking outcomes for highly symmetric molecules
are a barrier to the accurate evaluation of docking programs, scoring
functions, and protocols. This work describes an implementation of
a symmetry-corrected root-mean-square deviation (RMSD) method into
the program DOCK based on the Hungarian algorithm for solving the
minimum assignment problem, which dynamically assigns atom correspondence
in molecules with symmetry. The algorithm adds only a trivial amount
of computation time to the RMSD calculations and is shown to increase
the reported overall docking success rate by approximately 5% when
tested over 1043 receptor–ligand systems. For some families
of protein systems the results are even more dramatic, with success
rate increases up to 16.7%. Several additional applications of the
method are also presented including as a pairwise similarity metric
to compare molecules during de novo design, as a scoring function
to rank-order virtual screening results, and for the analysis of trajectories
from molecular dynamics simulation. The new method, including source
code, is available to registered users of DOCK6 (http://dock.compbio.ucsf.edu).
Collapse
Affiliation(s)
- William J Allen
- Department of Applied Mathematics & Statistics, Stony Brook University , Stony Brook, New York 11794, United States
| | | |
Collapse
|
11
|
Li J, Mach P, Koehl P. Measuring the shapes of macromolecules - and why it matters. Comput Struct Biotechnol J 2013; 8:e201309001. [PMID: 24688748 PMCID: PMC3962087 DOI: 10.5936/csbj.201309001] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2013] [Revised: 11/22/2013] [Accepted: 11/22/2013] [Indexed: 11/22/2022] Open
Abstract
The molecular basis of life rests on the activity of biological macromolecules, mostly nucleic acids and proteins. A perhaps surprising finding that crystallized over the last handful of decades is that geometric reasoning plays a major role in our attempt to understand these activities. In this paper, we address this connection between geometry and biology, focusing on methods for measuring and characterizing the shapes of macromolecules. We briefly review existing numerical and analytical approaches that solve these problems. We cover in more details our own work in this field, focusing on the alpha shape theory as it provides a unifying mathematical framework that enable the analytical calculations of the surface area and volume of a macromolecule represented as a union of balls, the detection of pockets and cavities in the molecule, and the quantification of contacts between the atomic balls. We have shown that each of these quantities can be related to physical properties of the molecule under study and ultimately provides insight on its activity. We conclude with a brief description of new challenges for the alpha shape theory in modern structural biology.
Collapse
Affiliation(s)
- Jie Li
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, United States
| | - Paul Mach
- Graduate Group of Applied Mathematics, University of California, Davis, 1, Shields Ave, Davis, CA, 95616, United States
| | - Patrice Koehl
- Department of Computer Science and Genome Center, University of California, Davis, 1, Shields Ave, Davis, CA, 95616, United States
| |
Collapse
|
12
|
Tu H, Shi T. Ligand Binding Site Similarity Identification Based on Chemical and Geometric Similarity. Protein J 2013; 32:373-85. [DOI: 10.1007/s10930-013-9494-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|