1
|
Bakker MJ, Gaffour A, Juhás M, Zapletal V, Stošek J, Bratholm LA, Pavlíková Přecechtělová J. Streamlining NMR Chemical Shift Predictions for Intrinsically Disordered Proteins: Design of Ensembles with Dimensionality Reduction and Clustering. J Chem Inf Model 2024; 64:6542-6556. [PMID: 39099394 PMCID: PMC11412307 DOI: 10.1021/acs.jcim.4c00809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/06/2024]
Abstract
By merging advanced dimensionality reduction (DR) and clustering algorithm (CA) techniques, our study advances the sampling procedure for predicting NMR chemical shifts (CS) in intrinsically disordered proteins (IDPs), making a significant leap forward in the field of protein analysis/modeling. We enhance NMR CS sampling by generating clustered ensembles that accurately reflect the different properties and phenomena encapsulated by the IDP trajectories. This investigation critically assessed different rapid CS predictors, both neural network (e.g., Sparta+ and ShiftX2) and database-driven (ProCS-15), and highlighted the need for more advanced quantum calculations and the subsequent need for more tractable-sized conformational ensembles. Although neural network CS predictors outperformed ProCS-15 for all atoms, all tools showed poor agreement with HN CSs, and the neural network CS predictors were unable to capture the influence of phosphorylated residues, highly relevant for IDPs. This study also addressed the limitations of using direct clustering with collective variables, such as the widespread implementation of the GROMOS algorithm. Clustered ensembles (CEs) produced by this algorithm showed poor performance with chemical shifts compared to sequential ensembles (SEs) of similar size. Instead, we implement a multiscale DR and CA approach and explore the challenges and limitations of applying these algorithms to obtain more robust and tractable CEs. The novel feature of this investigation is the use of solvent-accessible surface area (SASA) as one of the fingerprints for DR alongside previously investigated α carbon distance/angles or ϕ/ψ dihedral angles. The ensembles produced with SASA tSNE DR produced CEs better aligned with the experimental CS of between 0.17 and 0.36 r2 (0.18-0.26 ppm) depending on the system and replicate. Furthermore, this technique produced CEs with better agreement than traditional SEs in 85.7% of all ensemble sizes. This study investigates the quality of ensembles produced based on different input features, comparing latent spaces produced by linear vs nonlinear DR techniques and a novel integrated silhouette score scanning protocol for tSNE DR.
Collapse
Affiliation(s)
- Michael J Bakker
- Faculty of Pharmacy in Hradec Králové, Charles University, Akademika Heyrovského 1203/8, 500 05 Hradec Králové, Czech Republic
| | - Amina Gaffour
- Faculty of Pharmacy in Hradec Králové, Charles University, Akademika Heyrovského 1203/8, 500 05 Hradec Králové, Czech Republic
| | - Martin Juhás
- Faculty of Pharmacy in Hradec Králové, Charles University, Akademika Heyrovského 1203/8, 500 05 Hradec Králové, Czech Republic
- Department of Chemistry, Faculty of Science, University of Hradec Králové, Rokitanského 62, 500 03 Hradec Králové, Czech Republic
| | - Vojtěch Zapletal
- Faculty of Pharmacy in Hradec Králové, Charles University, Akademika Heyrovského 1203/8, 500 05 Hradec Králové, Czech Republic
| | - Jakub Stošek
- Faculty of Pharmacy in Hradec Králové, Charles University, Akademika Heyrovského 1203/8, 500 05 Hradec Králové, Czech Republic
- Department of Chemistry, Faculty of Science, Masaryk University, Kotlářská 2, 611 37 Brno, Czech Republic
| | - Lars A Bratholm
- School of Chemistry, University of Bristol, Cantock's Close, BS8 1TS Bristol, U.K
| | - Jana Pavlíková Přecechtělová
- Faculty of Pharmacy in Hradec Králové, Charles University, Akademika Heyrovského 1203/8, 500 05 Hradec Králové, Czech Republic
| |
Collapse
|
2
|
PDAUG: a Galaxy based toolset for peptide library analysis, visualization, and machine learning modeling. BMC Bioinformatics 2022; 23:197. [PMID: 35643441 PMCID: PMC9148462 DOI: 10.1186/s12859-022-04727-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 05/11/2022] [Indexed: 11/28/2022] Open
Abstract
Background Computational methods based on initial screening and prediction of peptides for desired functions have proven to be effective alternatives to lengthy and expensive biochemical experimental methods traditionally utilized in peptide research, thus saving time and effort. However, for many researchers, the lack of expertise in utilizing programming libraries, access to computational resources, and flexible pipelines are big hurdles to adopting these advanced methods.
Results To address the above mentioned barriers, we have implemented the peptide design and analysis under Galaxy (PDAUG) package, a Galaxy-based Python powered collection of tools, workflows, and datasets for rapid in-silico peptide library analysis. In contrast to existing methods like standard programming libraries or rigid single-function web-based tools, PDAUG offers an integrated GUI-based toolset, providing flexibility to build and distribute reproducible pipelines and workflows without programming expertise. Finally, we demonstrate the usability of PDAUG in predicting anticancer properties of peptides using four different feature sets and assess the suitability of various ML algorithms. Conclusion PDAUG offers tools for peptide library generation, data visualization, built-in and public database peptide sequence retrieval, peptide feature calculation, and machine learning (ML) modeling. Additionally, this toolset facilitates researchers to combine PDAUG with hundreds of compatible existing Galaxy tools for limitless analytic strategies. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04727-6.
Collapse
|
3
|
Pujal L, van Zyl M, Vöhringer-Martinez E, Verstraelen T, Bultinck P, Ayers PW, Heidar-Zadeh F. Constrained iterative Hirshfeld charges: A variational approach. J Chem Phys 2022; 156:194109. [PMID: 35597660 DOI: 10.1063/5.0089466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
We develop a variational procedure for the iterative Hirshfeld (HI) partitioning scheme. The main practical advantage of having a variational framework is that it provides a formal and straightforward approach for imposing constraints (e.g., fixed charges on certain atoms or molecular fragments) when computing HI atoms and their properties. Unlike many other variants of the Hirshfeld partitioning scheme, HI charges do not arise naturally from the information-theoretic framework, but only as a reverse-engineered construction of the objective function. However, the procedure we use is quite general and could be applied to other problems as well. We also prove that there is always at least one solution to the HI equations, but we could not prove that its self-consistent equations would always converge for any given initial pro-atom charges. Our numerical assessment of the constrained iterative Hirshfeld method shows that it satisfies many desirable traits of atoms in molecules and has the potential to surpass existing approaches for adding constraints when computing atomic properties.
Collapse
Affiliation(s)
- Leila Pujal
- Department of Chemistry, Queen's University, 90 Bader Lane, Kingston, Ontario K7N 3N6, Canada
| | - Maximilian van Zyl
- Department of Chemistry, Queen's University, 90 Bader Lane, Kingston, Ontario K7N 3N6, Canada
| | - Esteban Vöhringer-Martinez
- Departamento de Físico-Química, Facultad de Ciencias Químicas, Universidad de Concepción, Concepción, Chile
| | - Toon Verstraelen
- Center for Molecular Modeling (CMM), Ghent University, Technologiepark-Zwijnaarde 46, B-9052 Zwijnaarde, Belgium
| | - Patrick Bultinck
- Ghent Quantum Chemistry Group, Department of Chemistry, Ghent University, Krijgslaan 281 S3, B-9000 Ghent, Belgium
| | - Paul W Ayers
- Department of Chemistry and Chemical Biology, McMaster University, Hamilton, Ontario L8S 4L8, Canada
| | - Farnaz Heidar-Zadeh
- Department of Chemistry, Queen's University, 90 Bader Lane, Kingston, Ontario K7N 3N6, Canada
| |
Collapse
|
4
|
Cuevas-Zuviría B, Pacios LF. Machine Learning of Analytical Electron Density in Large Molecules Through Message-Passing. J Chem Inf Model 2021; 61:2658-2666. [PMID: 34009970 DOI: 10.1021/acs.jcim.1c00227] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Machine learning milestones in computational chemistry are overshadowed by their unaccountability and the overwhelming zoo of tools for each specific task. A promising path to tackle these problems is using machine learning to reproduce physical magnitudes as a basis to derive many other properties. By using a model of the electron density consisting of an analytical expansion on a linear set of isotropic and anisotropic functions, we implemented in this work a message-passing neural network able to reproduce electron density in molecules with just a 2.5% absolute error in complex cases. We also adapted our methodology to describe electron density in large biomolecules (proteins) and to obtain atomic charges, interaction energies, and DFT energies. We show that electron density learning is a new promising avenue with a variety of forthcoming applications.
Collapse
Affiliation(s)
- Bruno Cuevas-Zuviría
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM), Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223 Pozuelo de Alarcón, Madrid, Spain
| | - Luis F Pacios
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM), Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223 Pozuelo de Alarcón, Madrid, Spain.,Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agraria, Alimentaria y de Biosistemas (ETSIAAB), Universidad Politécnica de Madrid (UPM), 28040 Madrid, Spain
| |
Collapse
|
5
|
Reinholdt P, Kjellgren ER, Steinmann C, Olsen JMH. Cost-Effective Potential for Accurate Polarizable Embedding Calculations in Protein Environments. J Chem Theory Comput 2020; 16:1162-1174. [PMID: 31855427 DOI: 10.1021/acs.jctc.9b00616] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
The fragment-based polarizable embedding (PE) model combined with an appropriate electronic structure method constitutes a highly efficient and accurate multiscale approach for computing spectroscopic properties of a central moiety including effects from its molecular environment through an embedding potential. There is, however, a comparatively high computational overhead associated with the computation of the embedding potential, which is derived from first-principles calculations on individual fragments of the environment. To reduce the computational cost associated with the calculation of embedding potential parameters, we developed a set of amino acid-specific transferable parameters tailored for large-scale PE-based calculations that include proteins. The amino acid-based parameters are obtained by simultaneously fitting to a set of reference electric potentials based on structures derived from a backbone-dependent rotamer library. The developed cost-effective polarizable protein potential (CP3) consists of atom-centered charges and isotropic dipole-dipole polarizabilities of the standard amino acids. In terms of reproduction of electric potentials, the CP3 is shown to perform consistently and with acceptable accuracy across both small tripeptide test systems and larger proteins. We show, through applications on realistic protein systems, that acceptable accuracy can be obtained by using a pure CP3 representation of the protein environment, thus altogether omitting the cost associated with the calculation of embedding potential parameters. High accuracy comparable to that of the full fragment-based approach can be achieved through a mixed description where the CP3 is used only to describe amino acids beyond a threshold distance from the central quantum part.
Collapse
Affiliation(s)
- Peter Reinholdt
- Department of Physics, Chemistry and Pharmacy , University of Southern Denmark , Campusvej 55 , DK-5230 Odense M , Denmark
| | - Erik Rosendahl Kjellgren
- Department of Physics, Chemistry and Pharmacy , University of Southern Denmark , Campusvej 55 , DK-5230 Odense M , Denmark
| | - Casper Steinmann
- Department of Chemistry and Bioscience , Aalborg University , Fredrik Bajers Vej 7H , DK-9220 Aalborg , Denmark
| | - Jógvan Magnus Haugaard Olsen
- Hylleraas Centre for Quantum Molecular Sciences, Department of Chemistry , UiT The Arctic University of Norway , Tromsø N-9037 , Norway
| |
Collapse
|
6
|
Larsen AS, Bratholm LA, Christensen AS, Channir M, Jensen JH. ProCS15: a DFT-based chemical shift predictor for backbone and Cβ atoms in proteins. PeerJ 2015; 3:e1344. [PMID: 26623185 PMCID: PMC4662583 DOI: 10.7717/peerj.1344] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2015] [Accepted: 10/01/2015] [Indexed: 12/16/2022] Open
Abstract
We present ProCS15: a program that computes the isotropic chemical shielding values of backbone and Cβ atoms given a protein structure in less than a second. ProCS15 is based on around 2.35 million OPBE/6-31G(d,p)//PM6 calculations on tripeptides and small structural models of hydrogen-bonding. The ProCS15-predicted chemical shielding values are compared to experimentally measured chemical shifts for Ubiquitin and the third IgG-binding domain of Protein G through linear regression and yield RMSD values of up to 2.2, 0.7, and 4.8 ppm for carbon, hydrogen, and nitrogen atoms. These RMSD values are very similar to corresponding RMSD values computed using OPBE/6-31G(d,p) for the entire structure for each proteins. These maximum RMSD values can be reduced by using NMR-derived structural ensembles of Ubiquitin. For example, for the largest ensemble the largest RMSD values are 1.7, 0.5, and 3.5 ppm for carbon, hydrogen, and nitrogen. The corresponding RMSD values predicted by several empirical chemical shift predictors range between 0.7–1.1, 0.2–0.4, and 1.8–2.8 ppm for carbon, hydrogen, and nitrogen atoms, respectively.
Collapse
Affiliation(s)
- Anders S Larsen
- Department of Pharmacy, University of Copenhagen , Copenhagen , Denmark
| | - Lars A Bratholm
- Department of Chemistry, University of Copenhagen , Copenhagen , Denmark
| | | | - Maher Channir
- Department of Chemistry, University of Copenhagen , Copenhagen , Denmark
| | - Jan H Jensen
- Department of Chemistry, University of Copenhagen , Copenhagen , Denmark
| |
Collapse
|