1
|
Chen L, Mondal A, Perez A, Miranda-Quintana RA. Protein Retrieval via Integrative Molecular Ensembles (PRIME) through Extended Similarity Indices. J Chem Theory Comput 2024; 20:6303-6315. [PMID: 38978294 DOI: 10.1021/acs.jctc.4c00362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Molecular dynamics (MD) simulations are ideally suited to describe conformational ensembles of biomolecules such as proteins and nucleic acids. Microsecond-long simulations are now routine, facilitated by the emergence of graphical processing units. Clustering, which groups objects based on structural similarity, is typically used to process ensembles, leading to different states, their populations, and the identification of representative structures. A popular pipeline combines hierarchical clustering for clustering and selecting the cluster centroid as representative of the cluster. Here, we propose to improve on this approach, by developing a module-Protein Retrieval via Integrative Molecular Ensembles (PRIME), that consists of tools to improve the prediction of the representative in the most populated cluster using extended continuous similarity. PRIME is integrated with our Molecular Dynamics Analysis with N-ary Clustering Ensembles (MDANCE) package and can be used as a postprocessing tool for arbitrary clustering algorithms, compatible with several MD suites. PRIME predictions produced structures that when aligned to the experimental structure were better superposed (lower RMSD). A further benefit of PRIME is its linear scaling─rather than the traditional O(N2) traditionally associated with comparisons of elements in a set.
Collapse
Affiliation(s)
- Lexin Chen
- Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States
- Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| | - Arup Mondal
- Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States
- Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| | - Alberto Perez
- Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States
- Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| | - Ramón Alain Miranda-Quintana
- Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States
- Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| |
Collapse
|
2
|
López-Pérez K, López-López E, Medina-Franco JL, Miranda-Quintana RA. Sampling and Mapping Chemical Space with Extended Similarity Indices. Molecules 2023; 28:6333. [PMID: 37687162 PMCID: PMC10489020 DOI: 10.3390/molecules28176333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 08/24/2023] [Accepted: 08/26/2023] [Indexed: 09/10/2023] Open
Abstract
Visualization of the chemical space is useful in many aspects of chemistry, including compound library design, diversity analysis, and exploring structure-property relationships, to name a few. Examples of notable research areas where the visualization of chemical space has strong applications are drug discovery and natural product research. However, the sheer volume of even comparatively small sub-sections of chemical space implies that we need to use approximations at the time of navigating through chemical space. ChemMaps is a visualization methodology that approximates the distribution of compounds in large datasets based on the selection of satellite compounds that yield a similar mapping of the whole dataset when principal component analysis on a similarity matrix is performed. Here, we show how the recently proposed extended similarity indices can help find regions that are relevant to sample satellites and reduce the amount of high-dimensional data needed to describe a library's chemical space.
Collapse
Affiliation(s)
- Kenneth López-Pérez
- Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, FL 32611, USA;
| | - Edgar López-López
- DIFACQUIM Research Group, Department of Pharmacy, National Autonomous University of Mexico, Mexico City 04510, Mexico;
- Department of Chemistry and Graduate Program in Pharmacology, Center for Research and Advanced Studies of the National Polytechnic Institute, Mexico City 07000, Mexico
| | - José L. Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, National Autonomous University of Mexico, Mexico City 04510, Mexico;
| | | |
Collapse
|
3
|
Bajusz D, Pándy-Szekeres G, Takács Á, de Araujo ED, Keserű GM. SH2db, an information system for the SH2 domain. Nucleic Acids Res 2023:7173719. [PMID: 37207333 DOI: 10.1093/nar/gkad420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 05/04/2023] [Accepted: 05/07/2023] [Indexed: 05/21/2023] Open
Abstract
SH2 domains are key mediators of phosphotyrosine-based signalling, and therapeutic targets for diverse, mostly oncological, disease indications. They have a highly conserved structure with a central beta sheet that divides the binding surface of the protein into two main pockets, responsible for phosphotyrosine binding (pY pocket) and substrate specificity (pY + 3 pocket). In recent years, structural databases have proven to be invaluable resources for the drug discovery community, as they contain highly relevant and up-to-date information on important protein classes. Here, we present SH2db, a comprehensive structural database and webserver for SH2 domain structures. To organize these protein structures efficiently, we introduce (i) a generic residue numbering scheme to enhance the comparability of different SH2 domains, (ii) a structure-based multiple sequence alignment of all 120 human wild-type SH2 domain sequences and their PDB and AlphaFold structures. The aligned sequences and structures can be searched, browsed and downloaded from the online interface of SH2db (http://sh2db.ttk.hu), with functions to conveniently prepare multiple structures into a Pymol session, and to export simple charts on the contents of the database. Our hope is that SH2db can assist researchers in their day-to-day work by becoming a one-stop shop for SH2 domain related research.
Collapse
Affiliation(s)
- Dávid Bajusz
- Medicinal Chemistry Research Group and National Laboratory for Drug Researchand Development, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | - Gáspár Pándy-Szekeres
- Medicinal Chemistry Research Group and National Laboratory for Drug Researchand Development, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
- Department of Drug Design and Pharmacology, University of Copenhagen, Universitetsparken 2, 2100 Copenhagen, Denmark
| | - Ágnes Takács
- Medicinal Chemistry Research Group and National Laboratory for Drug Researchand Development, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | - Elvin D de Araujo
- Centre for Medicinal Chemistry, University of Toronto at Mississauga, Mississauga, ON L5L 1C6, Canada
| | - György M Keserű
- Medicinal Chemistry Research Group and National Laboratory for Drug Researchand Development, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
- Department of Organic Chemistry and Technology, Faculty of Chemical Technology and Biotechnology, Budapest University of Technology and Economics, Műegyetem rkp. 3, 1111 Budapest, Hungary
| |
Collapse
|
4
|
Rácz A, Mihalovits LM, Bajusz D, Héberger K, Miranda-Quintana RA. Molecular Dynamics Simulations and Diversity Selection by Extended Continuous Similarity Indices. J Chem Inf Model 2022; 62:3415-3425. [PMID: 35834424 PMCID: PMC9326969 DOI: 10.1021/acs.jcim.2c00433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
![]()
Molecular dynamics (MD) is a core methodology of molecular
modeling
and computational design for the study of the dynamics and temporal
evolution of molecular systems. MD simulations have particularly benefited
from the rapid increase of computational power that has characterized
the past decades of computational chemical research, being the first
method to be successfully migrated to the GPU infrastructure. While
new-generation MD software is capable of delivering simulations on
an ever-increasing scale, relatively less effort is invested in developing
postprocessing methods that can keep up with the quickly expanding
volumes of data that are being generated. Here, we introduce a new
idea for sampling frames from large MD trajectories, based on the
recently introduced framework of extended similarity indices. Our
approach presents a new, linearly scaling alternative to the traditional
approach of applying a clustering algorithm that usually scales as
a quadratic function of the number of frames. When showcasing its
usage on case studies with different system sizes and simulation lengths,
we have registered speedups of up to 2 orders of magnitude, as compared
to traditional clustering algorithms. The conformational diversity
of the selected frames is also noticeably higher, which is a further
advantage for certain applications, such as the selection of structural
ensembles for ligand docking. The method is available open-source
at https://github.com/ramirandaq/MultipleComparisons.
Collapse
Affiliation(s)
- Anita Rácz
- Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | - Levente M Mihalovits
- Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | - Dávid Bajusz
- Medicinal Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | - Károly Héberger
- Plasma Chemistry Research Group, Research Centre for Natural Sciences, Magyar tudósok krt. 2, 1117 Budapest, Hungary
| | - Ramón Alain Miranda-Quintana
- Department of Chemistry and Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| |
Collapse
|
5
|
Extended continuous similarity indices: theory and application for QSAR descriptor selection. J Comput Aided Mol Des 2022; 36:157-173. [PMID: 35288838 DOI: 10.1007/s10822-022-00444-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Accepted: 02/23/2022] [Indexed: 01/10/2023]
Abstract
Extended (or n-ary) similarity indices have been recently proposed to extend the comparative analysis of binary strings. Going beyond the traditional notion of pairwise comparisons, these novel indices allow comparing any number of objects at the same time. This results in a remarkable efficiency gain with respect to other approaches, since now we can compare N molecules in O(N) instead of the common quadratic O(N2) timescale. This favorable scaling has motivated the application of these indices to diversity selection, clustering, phylogenetic analysis, chemical space visualization, and post-processing of molecular dynamics simulations. However, the current formulation of the n-ary indices is limited to vectors with binary or categorical inputs. Here, we present the further generalization of this formalism so it can be applied to numerical data, i.e. to vectors with continuous components. We discuss several ways to achieve this extension and present their analytical properties. As a practical example, we apply this formalism to the problem of feature selection in QSAR and prove that the extended continuous similarity indices provide a convenient way to discern between several sets of descriptors.
Collapse
|
6
|
Chang L, Perez A, Miranda-Quintana RA. Improving the analysis of biological ensembles through extended similarity measures. Phys Chem Chem Phys 2021; 24:444-451. [PMID: 34897334 DOI: 10.1039/d1cp04019g] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
We present new algorithms to classify structural ensembles of macromolecules based on the recently proposed extended similarity measures. Molecular dynamics provides a wealth of structural information on systems of biological interest. As computer power increases, we capture larger ensembles and larger conformational transitions between states. Typically, structural clustering provides the statistical mechanics treatment of the system to identify relevant biological states. The key advantage of our approach is that the newly introduced extended similarity indices reduce the computational complexity of assessing the similarity of a set of structures from O(N2) to O(N). Here we take advantage of this favorable cost to develop several highly efficient techniques, including a linear-scaling algorithm to determine the medoid of a set (which we effectively use to select the most representative structure of a cluster). Moreover, we use our extended similarity indices as a linkage criterion in a novel hierarchical agglomerative clustering algorithm. We apply these new metrics to analyze the ensembles of several systems of biological interest such as folding and binding of macromolecules (peptide, protein, DNA-protein). In particular, we design a new workflow that is capable of identifying the most important conformations contributing to the protein folding process. We show excellent performance in the resulting clusters (surpassing traditional linkage criteria), along with faster performance and an efficient cost-function to identify when to merge clusters.
Collapse
Affiliation(s)
- Liwei Chang
- Department of Chemistry, University of Florida, Gainesville, FL, 32611, USA.
| | - Alberto Perez
- Department of Chemistry, University of Florida, Gainesville, FL, 32611, USA. .,Quantum Theory Project, University of Florida, Gainesville, FL, 32611, USA
| | - Ramón Alain Miranda-Quintana
- Department of Chemistry, University of Florida, Gainesville, FL, 32611, USA. .,Quantum Theory Project, University of Florida, Gainesville, FL, 32611, USA
| |
Collapse
|
7
|
Flores-Padilla EA, Juárez-Mercado KE, Naveja JJ, Kim TD, Alain Miranda-Quintana R, Medina-Franco JL. Chemoinformatic Characterization of Synthetic Screening Libraries Focused on Epigenetic Targets. Mol Inform 2021; 41:e2100285. [PMID: 34931466 DOI: 10.1002/minf.202100285] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 12/08/2021] [Indexed: 02/03/2023]
Abstract
The importance of epigenetic drug and probe discovery is on the rise. This is not only paramount to identify and develop therapeutic treatments associated with epigenetic processes but also to understand the underlying epigenetic mechanisms involved in biological processes. To this end, chemical vendors have been developing synthetic compound libraries focused on epigenetic targets to increase the probabilities of identifying promising starting points for drug or probe candidates. However, the chemical contents of these data sets, the distribution of their physicochemical properties, and diversity remain unknown. To fill this gap and make this information available to the scientific community, we report a comprehensive analysis of eleven libraries focused on epigenetic targets containing more than 50,000 compounds. We used well-validated chemoinformatics approaches to characterize these sets, including novel methods such as automated detection of analog series and visual representations of the chemical space based on Constellation Plots and Chemical Library Networks. This work will guide the efforts of experimental groups working on high-throughput and medium-throughput screening of epigenetic-focused libraries. The outcome of this work can also be used as a reference to design and describe novel focused epigenetic libraries.
Collapse
Affiliation(s)
- E Alexis Flores-Padilla
- DIFACQUIM Research Group, Department of Pharmacy, National Autonomous University of Mexico, Mexico City, 04510, Mexico
| | - K Eurídice Juárez-Mercado
- DIFACQUIM Research Group, Department of Pharmacy, National Autonomous University of Mexico, Mexico City, 04510, Mexico
| | - José J Naveja
- Instituto de Quimica, National Autonomous University of Mexico, Mexico City, 04510, Mexico
| | - Taewon D Kim
- Department of Chemistry, University of Florida, Gainesville, Florida, 32611, United States
| | - Ramón Alain Miranda-Quintana
- Department of Chemistry, University of Florida, Gainesville, Florida, 32611, United States.,Quantum Theory Project, University of Florida, Gainesville, Florida, 32611, United States
| | - José L Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, National Autonomous University of Mexico, Mexico City, 04510, Mexico
| |
Collapse
|
8
|
Dunn TB, Seabra GM, Kim TD, Juárez-Mercado KE, Li C, Medina-Franco JL, Miranda-Quintana RA. Diversity and Chemical Library Networks of Large Data Sets. J Chem Inf Model 2021; 62:2186-2201. [PMID: 34723537 DOI: 10.1021/acs.jcim.1c01013] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
The quantification of chemical diversity has many applications in drug discovery, organic chemistry, food, and natural product chemistry, to name a few. As the size of the chemical space is expanding rapidly, it is imperative to develop efficient methods to quantify the diversity of large and ultralarge chemical libraries and visualize their mutual relationships in chemical space. Herein, we show an application of our recently introduced extended similarity indices to measure the fingerprint-based diversity of 19 chemical libraries typically used in drug discovery and natural products research with over 18 million compounds. Based on this concept, we introduce the Chemical Library Networks (CLNs) as a general and efficient framework to represent visually the chemical space of large chemical libraries providing a global perspective of the relation between the libraries. For the 19 compound libraries explored in this work, it was found that the (extended) Tanimoto index offers the best description of extended similarity in combination with RDKit fingerprints. CLNs are general and can be explored with any structure representation and similarity coefficient for large chemical libraries.
Collapse
Affiliation(s)
- Timothy B Dunn
- Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States
| | - Gustavo M Seabra
- Department of Medicinal Chemistry, University of Florida, Gainesville, Florida 32610, United States.,Center for Natural Products, Drug Discovery and Development (CNPD3), University of Florida, Gainesville, Florida 32610, United States
| | - Taewon David Kim
- Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States
| | - K Eurídice Juárez-Mercado
- DIFACQUIM Research Group, Department of Pharmacy, National Autonomous University of Mexico, Mexico City 04510, Mexico
| | - Chenglong Li
- Department of Medicinal Chemistry, University of Florida, Gainesville, Florida 32610, United States.,Center for Natural Products, Drug Discovery and Development (CNPD3), University of Florida, Gainesville, Florida 32610, United States
| | - José L Medina-Franco
- DIFACQUIM Research Group, Department of Pharmacy, National Autonomous University of Mexico, Mexico City 04510, Mexico
| | - Ramón Alain Miranda-Quintana
- Department of Chemistry, University of Florida, Gainesville, Florida 32611, United States.,Quantum Theory Project, University of Florida, Gainesville, Florida 32611, United States
| |
Collapse
|