1
|
Appadurai R, Koneru JK, Bonomi M, Robustelli P, Srivastava A. Clustering Heterogeneous Conformational Ensembles of Intrinsically Disordered Proteins with t-Distributed Stochastic Neighbor Embedding. J Chem Theory Comput 2023; 19:4711-4727. [PMID: 37338049 PMCID: PMC11108026 DOI: 10.1021/acs.jctc.3c00224] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/21/2023]
Abstract
Intrinsically disordered proteins (IDPs) populate a range of conformations that are best described by a heterogeneous ensemble. Grouping an IDP ensemble into "structurally similar" clusters for visualization, interpretation, and analysis purposes is a much-desired but formidable task, as the conformational space of IDPs is inherently high-dimensional and reduction techniques often result in ambiguous classifications. Here, we employ the t-distributed stochastic neighbor embedding (t-SNE) technique to generate homogeneous clusters of IDP conformations from the full heterogeneous ensemble. We illustrate the utility of t-SNE by clustering conformations of two disordered proteins, Aβ42, and α-synuclein, in their APO states and when bound to small molecule ligands. Our results shed light on ordered substates within disordered ensembles and provide structural and mechanistic insights into binding modes that confer specificity and affinity in IDP ligand binding. t-SNE projections preserve the local neighborhood information, provide interpretable visualizations of the conformational heterogeneity within each ensemble, and enable the quantification of cluster populations and their relative shifts upon ligand binding. Our approach provides a new framework for detailed investigations of the thermodynamics and kinetics of IDP ligand binding and will aid rational drug design for IDPs.
Collapse
Affiliation(s)
- Rajeswari Appadurai
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Karnataka 560012, India
| | | | - Massimiliano Bonomi
- Structural Bioinformatics Unit, Department of Structural Biology and Chemistry. CNRS UMR 3528, C3BI, CNRS USR 3756, Institut Pasteur, Paris, France
| | - Paul Robustelli
- Dartmouth College, Department of Chemistry, Hanover, NH, 03755, USA
| | - Anand Srivastava
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, Karnataka 560012, India
| |
Collapse
|
2
|
Chen H, Liu H, Feng H, Fu H, Cai W, Shao X, Chipot C. MLCV: Bridging Machine-Learning-Based Dimensionality Reduction and Free-Energy Calculation. J Chem Inf Model 2021; 62:1-8. [PMID: 34939790 DOI: 10.1021/acs.jcim.1c01010] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Importance-sampling algorithms leaning on the definition of a model reaction coordinate (RC) are widely employed to probe processes relevant to chemistry and biology alike, spanning time scales not amenable to common, brute-force molecular dynamics (MD) simulations. In practice, the model RC often consists of a handful of collective variables (CVs) chosen on the basis of chemical intuition. However, constructing manually a low-dimensional RC model to describe an intricate geometrical transformation for the purpose of free-energy calculations and analyses remains a daunting challenge due to the inherent complexity of the conformational transitions at play. To solve this issue, remarkable progress has been made in employing machine-learning techniques, such as autoencoders, to extract the low-dimensional RC model from a large set of CVs. Implementation of the differentiable, nonlinear machine-learned CVs in common MD engines to perform free-energy calculations is, however, particularly cumbersome. To address this issue, we present here a user-friendly tool (called MLCV) that facilitates the use of machine-learned CVs in importance-sampling simulations through the popular Colvars module. Our approach is critically probed with three case examples consisting of small peptides, showcasing that through hard-coded neural network in Colvars, deep-learning and enhanced-sampling can be effectively bridged with MD simulations. The MLCV code is versatile, applicable to all the CVs available in Colvars, and can be connected to any kind of dense neural networks. We believe that MLCV provides an effective, powerful, and user-friendly platform accessible to experts and nonexperts alike for machine-learning (ML)-guided CV discovery and enhanced-sampling simulations to unveil the molecular mechanisms underlying complex biochemical processes.
Collapse
Affiliation(s)
- Haochuan Chen
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Nankai University, Tianjin 300071, China.,Tianjin Key Laboratory of Biosensing and Molecular Recognition, Tianjin 300071, China.,State Key Laboratory of Medicinal Chemical Biology, Tianjin 300071, China
| | - Han Liu
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Nankai University, Tianjin 300071, China.,Tianjin Key Laboratory of Biosensing and Molecular Recognition, Tianjin 300071, China.,State Key Laboratory of Medicinal Chemical Biology, Tianjin 300071, China
| | - Heying Feng
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Nankai University, Tianjin 300071, China.,Tianjin Key Laboratory of Biosensing and Molecular Recognition, Tianjin 300071, China.,State Key Laboratory of Medicinal Chemical Biology, Tianjin 300071, China
| | - Haohao Fu
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Nankai University, Tianjin 300071, China.,Tianjin Key Laboratory of Biosensing and Molecular Recognition, Tianjin 300071, China.,State Key Laboratory of Medicinal Chemical Biology, Tianjin 300071, China
| | - Wensheng Cai
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Nankai University, Tianjin 300071, China.,Tianjin Key Laboratory of Biosensing and Molecular Recognition, Tianjin 300071, China.,State Key Laboratory of Medicinal Chemical Biology, Tianjin 300071, China
| | - Xueguang Shao
- Research Center for Analytical Sciences, Frontiers Science Center for New Organic Matter, College of Chemistry, Nankai University, Tianjin 300071, China.,Tianjin Key Laboratory of Biosensing and Molecular Recognition, Tianjin 300071, China.,State Key Laboratory of Medicinal Chemical Biology, Tianjin 300071, China
| | - Christophe Chipot
- Laboratoire International Associé CNRS and University of Illinois at Urbana-Champaign, UMR no. 7019, Université de Lorraine, BP 70239, F-54506 Vandœuvre-lès-Nancy, France
| |
Collapse
|
3
|
Pospelov N, Tetereva A, Martynova O, Anokhin K. The Laplacian eigenmaps dimensionality reduction of fMRI data for discovering stimulus-induced changes in the resting-state brain activity. NEUROIMAGE: REPORTS 2021. [DOI: 10.1016/j.ynirp.2021.100035] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
4
|
Glielmo A, Husic BE, Rodriguez A, Clementi C, Noé F, Laio A. Unsupervised Learning Methods for Molecular Simulation Data. Chem Rev 2021; 121:9722-9758. [PMID: 33945269 PMCID: PMC8391792 DOI: 10.1021/acs.chemrev.0c01195] [Citation(s) in RCA: 116] [Impact Index Per Article: 38.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Indexed: 12/21/2022]
Abstract
Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.
Collapse
Affiliation(s)
- Aldo Glielmo
- International
School for Advanced Studies (SISSA) 34014 Trieste, Italy
| | - Brooke E. Husic
- Freie
Universität Berlin, Department of Mathematics
and Computer Science, 14195 Berlin, Germany
| | - Alex Rodriguez
- International Centre for Theoretical
Physics (ICTP), Condensed Matter and Statistical
Physics Section, 34100 Trieste, Italy
| | - Cecilia Clementi
- Freie
Universität Berlin, Department for
Physics, 14195 Berlin, Germany
- Rice
University Houston, Department of Chemistry, Houston, Texas 77005, United States
| | - Frank Noé
- Freie
Universität Berlin, Department of Mathematics
and Computer Science, 14195 Berlin, Germany
- Freie
Universität Berlin, Department for
Physics, 14195 Berlin, Germany
- Rice
University Houston, Department of Chemistry, Houston, Texas 77005, United States
| | - Alessandro Laio
- International
School for Advanced Studies (SISSA) 34014 Trieste, Italy
- International Centre for Theoretical
Physics (ICTP), Condensed Matter and Statistical
Physics Section, 34100 Trieste, Italy
| |
Collapse
|
5
|
Trozzi F, Wang X, Tao P. UMAP as a Dimensionality Reduction Tool for Molecular Dynamics Simulations of Biomacromolecules: A Comparison Study. J Phys Chem B 2021; 125:5022-5034. [PMID: 33973773 PMCID: PMC8356557 DOI: 10.1021/acs.jpcb.1c02081] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Proteins are the molecular machines of life. The multitude of possible conformations that proteins can adopt determines their free-energy landscapes. However, the inherently high dimensionality of a protein free-energy landscape poses a challenge to deciphering how proteins perform their functions. For this reason, dimensionality reduction is an active field of research for molecular biologists. The uniform manifold approximation and projection (UMAP) is a dimensionality reduction method based on a fuzzy topological analysis of data. In the present study, the performance of UMAP is compared with that of other popular dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE), principal component analysis (PCA), and time-structure independent components analysis (tICA) in the context of analyzing molecular dynamics simulations of the circadian clock protein VIVID. A good dimensionality reduction method should accurately represent the data structure on the projected components. The comparison of the raw high-dimensional data with the projections obtained using different dimensionality reduction methods based on various metrics showed that UMAP has superior performance when compared with linear reduction methods (PCA and tICA) and has competitive performance and scalable computational cost.
Collapse
Affiliation(s)
- Francesco Trozzi
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas, 75275, United States of America
| | - Xinlei Wang
- Department of Statistical Science, Southern Methodist University, Dallas, Texas, 75275, United States of America
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas, 75275, United States of America
| |
Collapse
|
6
|
Fajardo TN, Heyden M. Dissecting the Conformational Free Energy of a Small Peptide in Solution. J Phys Chem B 2021; 125:4634-4644. [PMID: 33942611 DOI: 10.1021/acs.jpcb.1c00699] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The free energy surface of a small peptide was analyzed based on an unbiased microsecond molecular dynamics simulation. The peptide sampled disordered conformational ensembles of distinct compactness, and its free energy was decomposed into separate contributions from the intramolecular potential energy, conformational entropy, and solvation free energy. The latter was further broken down into enthalpic and entropic contributions due to peptide-water and water-water interactions. This decomposition was enabled by a generalized linear response relation between the peptide-water interaction energy and the solvation free energy, which was empirically parametrized by explicit solvation free energy calculations for representative peptide conformations. This full dissection of the peptide free energy identifies individual contributions that stabilize and destabilize compact and extended peptide conformational ensembles and reveals the origin of a free energy barrier associated with transitions between them.
Collapse
Affiliation(s)
- Tawny N Fajardo
- School of Molecular Sciences, Arizona State University, Tempe, Arizona 85287-1604, United States
| | - Matthias Heyden
- School of Molecular Sciences, Arizona State University, Tempe, Arizona 85287-1604, United States
| |
Collapse
|
7
|
Rogers DM. Protein Conformational States-A First Principles Bayesian Method. ENTROPY 2020; 22:e22111242. [PMID: 33287010 PMCID: PMC7712966 DOI: 10.3390/e22111242] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 10/23/2020] [Accepted: 10/29/2020] [Indexed: 12/19/2022]
Abstract
Automated identification of protein conformational states from simulation of an ensemble of structures is a hard problem because it requires teaching a computer to recognize shapes. We adapt the naïve Bayes classifier from the machine learning community for use on atom-to-atom pairwise contacts. The result is an unsupervised learning algorithm that samples a ‘distribution’ over potential classification schemes. We apply the classifier to a series of test structures and one real protein, showing that it identifies the conformational transition with >95% accuracy in most cases. A nontrivial feature of our adaptation is a new connection to information entropy that allows us to vary the level of structural detail without spoiling the categorization. This is confirmed by comparing results as the number of atoms and time-samples are varied over 1.5 orders of magnitude. Further, the method’s derivation from Bayesian analysis on the set of inter-atomic contacts makes it easy to understand and extend to more complex cases.
Collapse
Affiliation(s)
- David M Rogers
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| |
Collapse
|
8
|
Spiwok V, Kříž P. Time-Lagged t-Distributed Stochastic Neighbor Embedding (t-SNE) of Molecular Simulation Trajectories. Front Mol Biosci 2020; 7:132. [PMID: 32714941 PMCID: PMC7344294 DOI: 10.3389/fmolb.2020.00132] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 06/03/2020] [Indexed: 11/30/2022] Open
Abstract
Molecular simulation trajectories represent high-dimensional data. Such data can be visualized by methods of dimensionality reduction. Non-linear dimensionality reduction methods are likely to be more efficient than linear ones due to the fact that motions of atoms are non-linear. Here we test a popular non-linear t-distributed Stochastic Neighbor Embedding (t-SNE) method on analysis of trajectories of 200 ns alanine dipeptide dynamics and 208 μs Trp-cage folding and unfolding. Furthermore, we introduce a time-lagged variant of t-SNE in order to focus on rarely occurring transitions in the molecular system. This time-lagged t-SNE efficiently separates states according to distance in time. Using this method it is possible to visualize key states of studied systems (e.g., unfolded and folded protein) as well as possible kinetic traps using a two-dimensional plot. Time-lagged t-SNE is a visualization method and other applications, such as clustering and free energy modeling, must be done with caution.
Collapse
Affiliation(s)
- Vojtěch Spiwok
- Department of Biochemistry and Microbiology, University of Chemistry and Technology, Prague, Czechia
| | - Pavel Kříž
- Department of Mathematics, University of Chemistry and Technology, Prague, Czechia
| |
Collapse
|
9
|
Fabrizio A, Meyer B, Corminboeuf C. Machine learning models of the energy curvature vs particle number for optimal tuning of long-range corrected functionals. J Chem Phys 2020; 152:154103. [DOI: 10.1063/5.0005039] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Affiliation(s)
- Alberto Fabrizio
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
| | - Benjamin Meyer
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
| | - Clemence Corminboeuf
- Laboratory for Computational Molecular Design, Institute of Chemical Sciences and Engineering, École Polytechnique Fédérale de Lausanne, CH-1015 Lausanne, Switzerland
- National Centre for Computational Design and Discovery of Novel Materials (MARVEL), École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland
| |
Collapse
|
10
|
Bejagam KK, Singh SK, Ahn R, Deshmukh SA. Unraveling the Conformations of Backbone and Side Chains in Thermosensitive Bottlebrush Polymers. Macromolecules 2019. [DOI: 10.1021/acs.macromol.9b01021] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Affiliation(s)
- Karteek K. Bejagam
- Department of Chemical Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States
| | | | - Rebecca Ahn
- Department of Chemical Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Sanket A. Deshmukh
- Department of Chemical Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States
| |
Collapse
|
11
|
Tan Q, Duan M, Li M, Han L, Huo S. Approximating dynamic proximity with a hybrid geometry energy-based kernel for diffusion maps. J Chem Phys 2019; 151:105101. [PMID: 31521094 DOI: 10.1063/1.5100968] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The diffusion map is a dimensionality reduction method. The reduction coordinates are associated with the leading eigenfunctions of the backward Fokker-Planck operator, providing a dynamic meaning for these coordinates. One of the key factors that affect the accuracy of diffusion map embedding is the dynamic measure implemented in the Gaussian kernel. A common practice in diffusion map study of molecular systems is to approximate dynamic proximity with RMSD (root-mean-square deviation). In this paper, we present a hybrid geometry-energy based kernel. Since high energy-barriers may exist between geometrically similar conformations, taking both RMSD and energy difference into account in the kernel can better describe conformational transitions between neighboring conformations and lead to accurate embedding. We applied our diffusion map method to the β-hairpin of the B1 domain of streptococcal protein G and to Trp-cage. Our results in β-hairpin show that the diffusion map embedding achieves better results with the hybrid kernel than that with the RMSD-based kernel in terms of free energy landscape characterization and a new correlation measure between the cluster center Euclidean distances in the reduced-dimension space and the reciprocals of the total net flow between these clusters. In addition, our diffusion map analysis of the ultralong molecular dynamics trajectory of Trp-cage has provided a unified view of its folding mechanism. These promising results demonstrate the effectiveness of our diffusion map approach in the analysis of the dynamics and thermodynamics of molecular systems. The hybrid geometry-energy criterion could be also useful as a general dynamic measure for other purposes.
Collapse
Affiliation(s)
- Qingzhe Tan
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts 01610, USA
| | - Mojie Duan
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts 01610, USA
| | - Minghai Li
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts 01610, USA
| | - Li Han
- Department of Math and Computer Science, Clark University, Worcester, Massachusetts 01610, USA
| | - Shuanghong Huo
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts 01610, USA
| |
Collapse
|
12
|
Tribello GA, Gasparotto P. Using Dimensionality Reduction to Analyze Protein Trajectories. Front Mol Biosci 2019; 6:46. [PMID: 31275943 PMCID: PMC6593086 DOI: 10.3389/fmolb.2019.00046] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2019] [Accepted: 05/31/2019] [Indexed: 11/24/2022] Open
Abstract
In recent years the analysis of molecular dynamics trajectories using dimensionality reduction algorithms has become commonplace. These algorithms seek to find a low-dimensional representation of a trajectory that is, according to a well-defined criterion, optimal. A number of different strategies for generating projections of trajectories have been proposed but little has been done to systematically compare how these various approaches fare when it comes to analysing trajectories for biomolecules in explicit solvent. In the following paper, we have thus analyzed a molecular dynamics trajectory of the C-terminal fragment of the immunoglobulin binding domain B1 of protein G of Streptococcus modeled in explicit solvent using a range of different dimensionality reduction algorithms. We have then tried to systematically compare the projections generated using each of these algorithms by using a clustering algorithm to find the positions and extents of the basins in the high-dimensional energy landscape. We find that no algorithm outshines all the other in terms of the quality of the projection it generates. Instead, all the algorithms do a reasonable job when it comes to building a projection that separates some of the configurations that lie in different basins. Having said that, however, all the algorithms struggle to project the basins because they all have a large intrinsic dimensionality.
Collapse
Affiliation(s)
- Gareth A Tribello
- Atomistic Simulation Centre, School of Mathematics and Physics, Queen's University Belfast, Belfast, United Kingdom
| | - Piero Gasparotto
- Department of Physics and Astronomy, Thomas Young Centre, University College London, London, United Kingdom
| |
Collapse
|
13
|
Post M, Wolf S, Stock G. Principal component analysis of nonequilibrium molecular dynamics simulations. J Chem Phys 2019; 150:204110. [PMID: 31153204 DOI: 10.1063/1.5089636] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Principal component analysis (PCA) represents a standard approach to identify collective variables {xi} = x, which can be used to construct the free energy landscape ΔG(x) of a molecular system. While PCA is routinely applied to equilibrium molecular dynamics (MD) simulations, it is less obvious as to how to extend the approach to nonequilibrium simulation techniques. This includes, e.g., the definition of the statistical averages employed in PCA as well as the relation between the equilibrium free energy landscape ΔG(x) and the energy landscapes ΔG(x) obtained from nonequilibrium MD. As an example for a nonequilibrium method, "targeted MD" is considered which employs a moving distance constraint to enforce rare transitions along some biasing coordinate s. The introduced bias can be described by a weighting function P(s), which provides a direct relation between equilibrium and nonequilibrium data, and thus establishes a well-defined way to perform PCA on nonequilibrium data. While the resulting distribution P(x) and energy ΔG∝lnP will not reflect the equilibrium state of the system, the nonequilibrium energy landscape ΔG(x) may directly reveal the molecular reaction mechanism. Applied to targeted MD simulations of the unfolding of decaalanine, for example, a PCA performed on backbone dihedral angles is shown to discriminate several unfolding pathways. Although the formulation is in principle exact, its practical use depends critically on the choice of the biasing coordinate s, which should account for a naturally occurring motion between two well-defined end-states of the system.
Collapse
Affiliation(s)
- Matthias Post
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| | - Steffen Wolf
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| | - Gerhard Stock
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| |
Collapse
|
14
|
Guo AZ, Lequieu J, de Pablo JJ. Extracting collective motions underlying nucleosome dynamics via nonlinear manifold learning. J Chem Phys 2019; 150:054902. [PMID: 30736679 DOI: 10.1063/1.5063851] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
The identification of effective collective variables remains a challenge in molecular simulations of complex systems. Here, we use a nonlinear manifold learning technique known as the diffusion map to extract key dynamical motions from a complex biomolecular system known as the nucleosome: a DNA-protein complex consisting of a DNA segment wrapped around a disc-shaped group of eight histone proteins. We show that without any a priori information, diffusion maps can identify and extract meaningful collective variables that characterize the motion of the nucleosome complex. We find excellent agreement between the collective variables identified by the diffusion map and those obtained manually using a free energy-based analysis. Notably, diffusion maps are shown to also identify subtle features of nucleosome dynamics that did not appear in those manually specified collective variables. For example, diffusion maps identify the importance of looped conformations in which DNA bulges away from the histone complex that are important for the motion of DNA around the nucleosome. This work demonstrates that diffusion maps can be a promising tool for analyzing very large molecular systems and for identifying their characteristic slow modes.
Collapse
Affiliation(s)
- Ashley Z Guo
- Institute for Molecular Engineering, University of Chicago, Chicago, Illinois 60637, USA
| | - Joshua Lequieu
- Institute for Molecular Engineering, University of Chicago, Chicago, Illinois 60637, USA
| | - Juan J de Pablo
- Institute for Molecular Engineering, University of Chicago, Chicago, Illinois 60637, USA
| |
Collapse
|
15
|
Nagel D, Weber A, Lickert B, Stock G. Dynamical coring of Markov state models. J Chem Phys 2019; 150:094111. [DOI: 10.1063/1.5081767] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Affiliation(s)
- Daniel Nagel
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| | - Anna Weber
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| | - Benjamin Lickert
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| | - Gerhard Stock
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| |
Collapse
|
16
|
Mechanism of glucocerebrosidase activation and dysfunction in Gaucher disease unraveled by molecular dynamics and deep learning. Proc Natl Acad Sci U S A 2019; 116:5086-5095. [PMID: 30808805 DOI: 10.1073/pnas.1818411116] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
The lysosomal enzyme glucocerebrosidase-1 (GCase) catalyzes the cleavage of a major glycolipid glucosylceramide into glucose and ceramide. The absence of fully functional GCase leads to the accumulation of its lipid substrates in lysosomes, causing Gaucher disease, an autosomal recessive disorder that displays profound genotype-phenotype nonconcordance. More than 250 disease-causing mutations in GBA1, the gene encoding GCase, have been discovered, although only one of these, N370S, causes 70% of disease. Here, we have used a knowledge-based docking protocol that considers experimental data of protein-protein binding to generate a complex between GCase and its known facilitator protein saposin C (SAPC). Multiscale molecular-dynamics simulations were used to study lipid self-assembly, membrane insertion, and the dynamics of the interactions between different components of the complex. Deep learning was applied to propose a model that explains the mechanism of GCase activation, which requires SAPC. Notably, we find that conformational changes in the loops at the entrance of the substrate-binding site are stabilized by direct interactions with SAPC and that the loss of such interactions induced by N370S and another common mutation, L444P, result in destabilization of the complex and reduced GCase activation. Our findings provide an atomistic-level explanation for GCase activation and the precise mechanism through which N370S and L444P cause Gaucher disease.
Collapse
|
17
|
Schöberl M, Zabaras N, Koutsourelakis PS. Predictive collective variable discovery with deep Bayesian models. J Chem Phys 2019; 150:024109. [DOI: 10.1063/1.5058063] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Markus Schöberl
- Center for Informatics and Computational Science, University of Notre Dame, 311 Cushing Hall, Notre Dame, Indiana 46556, USA
- Continuum Mechanics Group, Technical University of Munich, Boltzmannstraße 15, 85748 Garching, Germany
| | - Nicholas Zabaras
- Center for Informatics and Computational Science, University of Notre Dame, 311 Cushing Hall, Notre Dame, Indiana 46556, USA
| | | |
Collapse
|
18
|
Abstract
BACKGROUND We examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes. RESULTS We use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and β- β- α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts, on average, nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features. CONCLUSIONS Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.
Collapse
Affiliation(s)
- Debsindhu Bhowmik
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Shang Gao
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Michael T Young
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Arvind Ramanathan
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA.
| |
Collapse
|
19
|
Zhou H, Wang F, Tao P. t-Distributed Stochastic Neighbor Embedding Method with the Least Information Loss for Macromolecular Simulations. J Chem Theory Comput 2018; 14:5499-5510. [PMID: 30252473 PMCID: PMC6679899 DOI: 10.1021/acs.jctc.8b00652] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Dimensionality reduction methods are usually applied on molecular dynamics simulations of macromolecules for analysis and visualization purposes. It is normally desired that suitable dimensionality reduction methods could clearly distinguish functionally important states with different conformations for the systems of interest. However, common dimensionality reduction methods for macromolecules simulations, including predefined order parameters and collective variables (CVs), principal component analysis (PCA), and time-structure based independent component analysis (t-ICA), only have limited success due to significant key structural information loss. Here, we introduced the t-distributed stochastic neighbor embedding (t-SNE) method as a dimensionality reduction method with minimum structural information loss widely used in bioinformatics for analyses of macromolecules, especially biomacromolecules simulations. It is demonstrated that both one-dimensional (1D) and two-dimensional (2D) models of the t-SNE method are superior to distinguish important functional states of a model allosteric protein system for free energy and mechanistic analysis. Projections of the model protein simulations onto 1D and 2D t-SNE surfaces provide both clear visual cues and quantitative information, which is not readily available using other methods, regarding the transition mechanism between two important functional states of this protein.
Collapse
Affiliation(s)
- Hongyu Zhou
- Department of Chemistry, Center for Scientific Computation, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75275, United States of America
| | - Feng Wang
- Department of Chemistry, Center for Scientific Computation, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75275, United States of America
| | - Peng Tao
- Department of Chemistry, Center for Scientific Computation, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75275, United States of America
| |
Collapse
|
20
|
Sittel F, Stock G. Perspective: Identification of collective variables and metastable states of protein dynamics. J Chem Phys 2018; 149:150901. [PMID: 30342445 DOI: 10.1063/1.5049637] [Citation(s) in RCA: 84] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The statistical analysis of molecular dynamics simulations requires dimensionality reduction techniques, which yield a low-dimensional set of collective variables (CVs) {x i } = x that in some sense describe the essential dynamics of the system. Considering the distribution P( x ) of the CVs, the primal goal of a statistical analysis is to detect the characteristic features of P( x ), in particular, its maxima and their connection paths. This is because these features characterize the low-energy regions and the energy barriers of the corresponding free energy landscape ΔG( x ) = -k B T ln P( x ), and therefore amount to the metastable states and transition regions of the system. In this perspective, we outline a systematic strategy to identify CVs and metastable states, which subsequently can be employed to construct a Langevin or a Markov state model of the dynamics. In particular, we account for the still limited sampling typically achieved by molecular dynamics simulations, which in practice seriously limits the applicability of theories (e.g., assuming ergodicity) and black-box software tools (e.g., using redundant input coordinates). We show that it is essential to use internal (rather than Cartesian) input coordinates, employ dimensionality reduction methods that avoid rescaling errors (such as principal component analysis), and perform density based (rather than k-means-type) clustering. Finally, we briefly discuss a machine learning approach to dimensionality reduction, which highlights the essential internal coordinates of a system and may reveal hidden reaction mechanisms.
Collapse
Affiliation(s)
- Florian Sittel
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| | - Gerhard Stock
- Biomolecular Dynamics, Institute of Physics, Albert Ludwigs University, 79104 Freiburg, Germany
| |
Collapse
|
21
|
Abstract
The minimum energy pathway contains important information describing the transition between two states on a potential energy surface (PES). Chain-of-states methods were developed to efficiently calculate minimum energy pathways connecting two stable states. In the chain-of-states framework, a series of structures are generated and optimized to represent the minimum energy pathway connecting two states. However, multiple pathways may exist connecting two existing states and should be identified to obtain a full view of the transitions. Therefore, we developed an enhanced sampling method, named as the direct pathway dynamics sampling (DPDS) method, to facilitate exploration of a PES for multiple pathways connecting two stable states as well as addition minima and their associated transition pathways. In the DPDS method, molecular dynamics simulations are carried out on the targeting PES within a chain-of-states framework to directly sample the transition pathway space. The simulations of DPDS could be regulated by two parameters controlling distance among states along the pathway and smoothness of the pathway. One advantage of the chain-of-states framework is that no specific reaction coordinates are necessary to generate the reaction pathway, because such information is implicitly represented by the structures along the pathway. The chain-of-states setup in a DPDS method greatly enhances the sufficient sampling in high-energy space between two end states, such as transition states. By removing the constraint on the end states of the pathway, DPDS will also sample pathways connecting minima on a PES in addition to the end points of the starting pathway. This feature makes DPDS an ideal method to directly explore transition pathway space. Three examples demonstrate the efficiency of DPDS methods in sampling the high-energy area important for reactions on the PES.
Collapse
Affiliation(s)
- Hongyu Zhou
- Department of Chemistry, Center for Drug Discovery, Design, and Delivery (CD4), Center for Scientific Computation, Southern Methodist University , Dallas, Texas 75275, United States of America
| | - Peng Tao
- Department of Chemistry, Center for Drug Discovery, Design, and Delivery (CD4), Center for Scientific Computation, Southern Methodist University , Dallas, Texas 75275, United States of America
| |
Collapse
|
22
|
Demharter S, Knapp B, Deane CM, Minary P. Modeling Functional Motions of Biological Systems by Customized Natural Moves. Biophys J 2017; 111:710-721. [PMID: 27558715 PMCID: PMC5002067 DOI: 10.1016/j.bpj.2016.06.028] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Revised: 06/20/2016] [Accepted: 06/22/2016] [Indexed: 11/30/2022] Open
Abstract
Simulating the functional motions of biomolecular systems requires large computational resources. We introduce a computationally inexpensive protocol for the systematic testing of hypotheses regarding the dynamic behavior of proteins and nucleic acids. The protocol is based on natural move Monte Carlo, a highly efficient conformational sampling method with built-in customization capabilities that allows researchers to design and perform a large number of simulations to investigate functional motions in biological systems. We demonstrate the use of this protocol on both a protein and a DNA case study. Firstly, we investigate the plasticity of a class II major histocompatibility complex in the absence of a bound peptide. Secondly, we study the effects of the epigenetic mark 5-hydroxymethyl on cytosine on the structure of the Dickerson-Drew dodecamer. We show how our customized natural moves protocol can be used to investigate causal relationships of functional motions in biological systems.
Collapse
Affiliation(s)
- Samuel Demharter
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Bernhard Knapp
- Department of Statistics, University of Oxford, Oxford, UK
| | | | - Peter Minary
- Department of Computer Science, University of Oxford, Oxford, UK.
| |
Collapse
|
23
|
Liu H, Li M, Fan J, Huo S. Inherent structure versus geometric metric for state space discretization. J Comput Chem 2016; 37:1251-8. [PMID: 26915811 DOI: 10.1002/jcc.24315] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2015] [Revised: 11/17/2015] [Accepted: 01/06/2016] [Indexed: 01/13/2023]
Abstract
Inherent structure (IS) and geometry-based clustering methods are commonly used for analyzing molecular dynamics trajectories. ISs are obtained by minimizing the sampled conformations into local minima on potential/effective energy surface. The conformations that are minimized into the same energy basin belong to one cluster. We investigate the influence of the applications of these two methods of trajectory decomposition on our understanding of the thermodynamics and kinetics of alanine tetrapeptide. We find that at the microcluster level, the IS approach and root-mean-square deviation (RMSD)-based clustering method give totally different results. Depending on the local features of energy landscape, the conformations with close RMSDs can be minimized into different minima, while the conformations with large RMSDs could be minimized into the same basin. However, the relaxation timescales calculated based on the transition matrices built from the microclusters are similar. The discrepancy at the microcluster level leads to different macroclusters. Although the dynamic models established through both clustering methods are validated approximately Markovian, the IS approach seems to give a meaningful state space discretization at the macrocluster level in terms of conformational features and kinetics.
Collapse
Affiliation(s)
- Hanzhong Liu
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts, 01610
| | - Minghai Li
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts, 01610
| | - Jue Fan
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts, 01610
| | - Shuanghong Huo
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts, 01610
| |
Collapse
|
24
|
Kim SB, Dsilva CJ, Kevrekidis IG, Debenedetti PG. Systematic characterization of protein folding pathways using diffusion maps: Application to Trp-cage miniprotein. J Chem Phys 2015; 142:085101. [DOI: 10.1063/1.4913322] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Sang Beom Kim
- Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey 08544, USA
| | - Carmeline J. Dsilva
- Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey 08544, USA
| | - Ioannis G. Kevrekidis
- Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey 08544, USA
- Program in Applied and Computational Mathematics, Princeton University, Princeton, New Jersey 08544, USA
| | - Pablo G. Debenedetti
- Department of Chemical and Biological Engineering, Princeton University, Princeton, New Jersey 08544, USA
| |
Collapse
|
25
|
Li M, Duan M, Fan J, Han L, Huo S. Graph representation of protein free energy landscape. J Chem Phys 2014; 139:185101. [PMID: 24320303 DOI: 10.1063/1.4829768] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
The thermodynamics and kinetics of protein folding and protein conformational changes are governed by the underlying free energy landscape. However, the multidimensional nature of the free energy landscape makes it difficult to describe. We propose to use a weighted-graph approach to depict the free energy landscape with the nodes on the graph representing the conformational states and the edge weights reflecting the free energy barriers between the states. Our graph is constructed from a molecular dynamics trajectory and does not involve projecting the multi-dimensional free energy landscape onto a low-dimensional space defined by a few order parameters. The calculation of free energy barriers was based on transition-path theory using the MSMBuilder2 package. We compare our graph with the widely used transition disconnectivity graph (TRDG) which is constructed from the same trajectory and show that our approach gives more accurate description of the free energy landscape than the TRDG approach even though the latter can be organized into a simple tree representation. The weighted-graph is a general approach and can be used on any complex system.
Collapse
Affiliation(s)
- Minghai Li
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, 950 Main Street, Worcester, Massachusetts 01610, USA
| | | | | | | | | |
Collapse
|
26
|
Duan M, Li M, Han L, Huo S. Euclidean sections of protein conformation space and their implications in dimensionality reduction. Proteins 2014; 82:2585-96. [PMID: 24913095 DOI: 10.1002/prot.24622] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2014] [Revised: 05/06/2014] [Accepted: 05/30/2014] [Indexed: 01/05/2023]
Abstract
Dimensionality reduction is widely used in searching for the intrinsic reaction coordinates for protein conformational changes. We find the dimensionality-reduction methods using the pairwise root-mean-square deviation (RMSD) as the local distance metric face a challenge. We use Isomap as an example to illustrate the problem. We believe that there is an implied assumption for the dimensionality-reduction approaches that aim to preserve the geometric relations between the objects: both the original space and the reduced space have the same kind of geometry, such as Euclidean geometry vs. Euclidean geometry or spherical geometry vs. spherical geometry. When the protein free energy landscape is mapped onto a 2D plane or 3D space, the reduced space is Euclidean, thus the original space should also be Euclidean. For a protein with N atoms, its conformation space is a subset of the 3N-dimensional Euclidean space R(3N). We formally define the protein conformation space as the quotient space of R(3N) by the equivalence relation of rigid motions. Whether the quotient space is Euclidean or not depends on how it is parameterized. When the pairwise RMSD is employed as the local distance metric, implicit representations are used for the protein conformation space, leading to no direct correspondence to a Euclidean set. We have demonstrated that an explicit Euclidean-based representation of protein conformation space and the local distance metric associated to it improve the quality of dimensionality reduction in the tetra-peptide and β-hairpin systems.
Collapse
Affiliation(s)
- Mojie Duan
- Gustaf H. Carlson School of Chemistry and Biochemistry, Clark University, Worcester, Massachusetts, 01610
| | | | | | | |
Collapse
|