1
|
Arbon R, Zhu Y, Mey ASJS. Markov State Models: To Optimize or Not to Optimize. J Chem Theory Comput 2024; 20:977-988. [PMID: 38163961 PMCID: PMC10809420 DOI: 10.1021/acs.jctc.3c01134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 12/10/2023] [Accepted: 12/11/2023] [Indexed: 01/03/2024]
Abstract
Markov state models (MSM) are a popular statistical method for analyzing the conformational dynamics of proteins including protein folding. With all statistical and machine learning (ML) models, choices must be made about the modeling pipeline that cannot be directly learned from the data. These choices, or hyperparameters, are often evaluated by expert judgment or, in the case of MSMs, by maximizing variational scores such as the VAMP-2 score. Modern ML and statistical pipelines often use automatic hyperparameter selection techniques ranging from the simple, choosing the best score from a random selection of hyperparameters, to the complex, optimization via, e.g., Bayesian optimization. In this work, we ask whether it is possible to automatically select MSM models this way by estimating and analyzing over 16,000,000 observations from over 280,000 estimated MSMs. We find that differences in hyperparameters can change the physical interpretation of the optimization objective, making automatic selection difficult. In addition, we find that enforcing conditions of equilibrium in the VAMP scores can result in inconsistent model selection. However, other parameters that specify the VAMP-2 score (lag time and number of relaxation processes scored) have only a negligible influence on model selection. We suggest that model observables and variational scores should be only a guide to model selection and that a full investigation of the MSM properties should be undertaken when selecting hyperparameters.
Collapse
Affiliation(s)
- Robert
E. Arbon
- EaStCHEM
School of Chemistry, David Brewster Road, Joseph Black Building, The King’s Buildings, Edinburgh EH9 3FJ, United Kingdom
- Redesign
Science, 180 Varick St., New York, New York 10014, United States
| | - Yanchen Zhu
- EaStCHEM
School of Chemistry, David Brewster Road, Joseph Black Building, The King’s Buildings, Edinburgh EH9 3FJ, United Kingdom
| | - Antonia S. J. S. Mey
- EaStCHEM
School of Chemistry, David Brewster Road, Joseph Black Building, The King’s Buildings, Edinburgh EH9 3FJ, United Kingdom
| |
Collapse
|
2
|
Jiang H, Li H, Wong WH, Fan X. Revealing Free Energy Landscape From MD Data via Conditional Angle Partition Tree. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1384-1394. [PMID: 35503836 DOI: 10.1109/tcbb.2022.3172352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Deciphering the free energy landscape of biomolecular structure space is crucial for understanding many complex molecular processes, such as protein-protein interaction, RNA folding, and protein folding. A major source of current dynamic structure data is Molecular Dynamics (MD) simulations. Several methods have been proposed to investigate the free energy landscape from MD data, but all of them rely on the assumption that kinetic similarity is associated with global geometric similarity, which may lead to unsatisfactory results. In this paper, we proposed a new method called Conditional Angle Partition Tree to reveal the hierarchical free energy landscape by correlating local geometric similarity with kinetic similarity. Its application on the benchmark alanine dipeptide MD data showed a much better performance than existing methods in exploring and understanding the free energy landscape. We also applied it to the MD data of Villin HP35. Our results are more reasonable on various aspects than those from other methods and very informative on the hierarchical structure of its energy landscape.
Collapse
|
3
|
Sharpe DJ, Wales DJ. Nearly reducible finite Markov chains: Theory and algorithms. J Chem Phys 2021; 155:140901. [PMID: 34654307 DOI: 10.1063/5.0060978] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Finite Markov chains, memoryless random walks on complex networks, appear commonly as models for stochastic dynamics in condensed matter physics, biophysics, ecology, epidemiology, economics, and elsewhere. Here, we review exact numerical methods for the analysis of arbitrary discrete- and continuous-time Markovian networks. We focus on numerically stable methods that are required to treat nearly reducible Markov chains, which exhibit a separation of characteristic timescales and are therefore ill-conditioned. In this metastable regime, dense linear algebra methods are afflicted by propagation of error in the finite precision arithmetic, and the kinetic Monte Carlo algorithm to simulate paths is unfeasibly inefficient. Furthermore, iterative eigendecomposition methods fail to converge without the use of nontrivial and system-specific preconditioning techniques. An alternative approach is provided by state reduction procedures, which do not require additional a priori knowledge of the Markov chain. Macroscopic dynamical quantities, such as moments of the first passage time distribution for a transition to an absorbing state, and microscopic properties, such as the stationary, committor, and visitation probabilities for nodes, can be computed robustly using state reduction algorithms. The related kinetic path sampling algorithm allows for efficient sampling of trajectories on a nearly reducible Markov chain. Thus, all of the information required to determine the kinetically relevant transition mechanisms, and to identify the states that have a dominant effect on the global dynamics, can be computed reliably even for computationally challenging models. Rare events are a ubiquitous feature of realistic dynamical systems, and so the methods described herein are valuable in many practical applications.
Collapse
Affiliation(s)
- Daniel J Sharpe
- Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - David J Wales
- Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|
4
|
Jiang H, Fan X. The Two-Step Clustering Approach for Metastable States Learning. Int J Mol Sci 2021; 22:6576. [PMID: 34205252 PMCID: PMC8233889 DOI: 10.3390/ijms22126576] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 06/14/2021] [Accepted: 06/14/2021] [Indexed: 01/20/2023] Open
Abstract
Understanding the energy landscape and the conformational dynamics is crucial for studying many biological or chemical processes, such as protein-protein interaction and RNA folding. Molecular Dynamics (MD) simulations have been a major source of dynamic structure. Although many methods were proposed for learning metastable states from MD data, some key problems are still in need of further investigation. Here, we give a brief review on recent progresses in this field, with an emphasis on some popular methods belonging to a two-step clustering framework, and hope to draw more researchers to contribute to this area.
Collapse
Affiliation(s)
- Hangjin Jiang
- Center for Data Science, Zhejiang University, Hangzhou 310058, China;
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong, China
| |
Collapse
|
5
|
Ward MD, Zimmerman MI, Meller A, Chung M, Swamidass SJ, Bowman GR. Deep learning the structural determinants of protein biochemical properties by comparing structural ensembles with DiffNets. Nat Commun 2021; 12:3023. [PMID: 34021153 PMCID: PMC8140102 DOI: 10.1038/s41467-021-23246-1] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 04/16/2021] [Indexed: 12/05/2022] Open
Abstract
Understanding the structural determinants of a protein's biochemical properties, such as activity and stability, is a major challenge in biology and medicine. Comparing computer simulations of protein variants with different biochemical properties is an increasingly powerful means to drive progress. However, success often hinges on dimensionality reduction algorithms for simplifying the complex ensemble of structures each variant adopts. Unfortunately, common algorithms rely on potentially misleading assumptions about what structural features are important, such as emphasizing larger geometric changes over smaller ones. Here we present DiffNets, self-supervised autoencoders that avoid such assumptions, and automatically identify the relevant features, by requiring that the low-dimensional representations they learn are sufficient to predict the biochemical differences between protein variants. For example, DiffNets automatically identify subtle structural signatures that predict the relative stabilities of β-lactamase variants and duty ratios of myosin isoforms. DiffNets should also be applicable to understanding other perturbations, such as ligand binding.
Collapse
Affiliation(s)
- Michael D Ward
- Department of Biochemistry & Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for the Science and Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO, USA
| | - Maxwell I Zimmerman
- Department of Biochemistry & Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for the Science and Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO, USA
| | - Artur Meller
- Department of Biochemistry & Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for the Science and Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO, USA
| | - Moses Chung
- Department of Biochemistry & Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for the Science and Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO, USA
| | - S J Swamidass
- Department of Pathology & Immunology, Washington University School of Medicine, St. Louis, MO, USA
| | - Gregory R Bowman
- Department of Biochemistry & Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA.
- Center for the Science and Engineering of Living Systems, Washington University in St. Louis, St. Louis, MO, USA.
| |
Collapse
|
6
|
Weiß RG, Ries B, Wang S, Riniker S. Volume-scaled common nearest neighbor clustering algorithm with free-energy hierarchy. J Chem Phys 2021; 154:084106. [PMID: 33639726 DOI: 10.1063/5.0025797] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The combination of Markov state modeling (MSM) and molecular dynamics (MD) simulations has been shown in recent years to be a valuable approach to unravel the slow processes of molecular systems with increasing complexity. While the algorithms for intermediate steps in the MSM workflow such as featurization and dimensionality reduction have been specifically adapted to MD datasets, conventional clustering methods are generally applied to the discretization step. This work adds to recent efforts to develop specialized density-based clustering algorithms for the Boltzmann-weighted data from MD simulations. We introduce the volume-scaled common nearest neighbor (vs-CNN) clustering that is an adapted version of the common nearest neighbor (CNN) algorithm. A major advantage of the proposed algorithm is that the introduced density-based criterion directly links to a free-energy notion via Boltzmann inversion. Such a free-energy perspective allows a straightforward hierarchical scheme to identify conformational clusters at different levels of a generally rugged free-energy landscape of complex molecular systems.
Collapse
Affiliation(s)
- R Gregor Weiß
- Laboratory of Physical Chemistry, ETH Zürich, Vladimir-Prelog-Weg 2, 8093 Zürich, Switzerland
| | - Benjamin Ries
- Laboratory of Physical Chemistry, ETH Zürich, Vladimir-Prelog-Weg 2, 8093 Zürich, Switzerland
| | - Shuzhe Wang
- Laboratory of Physical Chemistry, ETH Zürich, Vladimir-Prelog-Weg 2, 8093 Zürich, Switzerland
| | - Sereina Riniker
- Laboratory of Physical Chemistry, ETH Zürich, Vladimir-Prelog-Weg 2, 8093 Zürich, Switzerland
| |
Collapse
|
7
|
Affiliation(s)
- Francesco Cocina
- Biochemistry Department, University of Zurich, Zurich CH-8057, Switzerland
| | - Andreas Vitalis
- Biochemistry Department, University of Zurich, Zurich CH-8057, Switzerland
| | - Amedeo Caflisch
- Biochemistry Department, University of Zurich, Zurich CH-8057, Switzerland
| |
Collapse
|
8
|
Recent Progress towards Chemically-Specific Coarse-Grained Simulation Models with Consistent Dynamical Properties. COMPUTATION 2019. [DOI: 10.3390/computation7030042] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
Coarse-grained (CG) models can provide computationally efficient and conceptually simple characterizations of soft matter systems. While generic models probe the underlying physics governing an entire family of free-energy landscapes, bottom-up CG models are systematically constructed from a higher-resolution model to retain a high level of chemical specificity. The removal of degrees of freedom from the system modifies the relationship between the relative time scales of distinct dynamical processes through both a loss of friction and a “smoothing” of the free-energy landscape. While these effects typically result in faster dynamics, decreasing the computational expense of the model, they also obscure the connection to the true dynamics of the system. The lack of consistent dynamics is a serious limitation for CG models, which not only prevents quantitatively accurate predictions of dynamical observables but can also lead to qualitatively incorrect descriptions of the characteristic dynamical processes. With many methods available for optimizing the structural and thermodynamic properties of chemically-specific CG models, recent years have seen a stark increase in investigations addressing the accurate description of dynamical properties generated from CG simulations. In this review, we present an overview of these efforts, ranging from bottom-up parameterizations of generalized Langevin equations to refinements of the CG force field based on a Markov state modeling framework. We aim to make connections between seemingly disparate approaches, while laying out some of the major challenges as well as potential directions for future efforts.
Collapse
|
9
|
Thiede EH, Giannakis D, Dinner AR, Weare J. Galerkin approximation of dynamical quantities using trajectory data. J Chem Phys 2019; 150:244111. [PMID: 31255053 PMCID: PMC6824902 DOI: 10.1063/1.5063730] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2018] [Accepted: 05/13/2019] [Indexed: 11/14/2022] Open
Abstract
Understanding chemical mechanisms requires estimating dynamical statistics such as expected hitting times, reaction rates, and committors. Here, we present a general framework for calculating these dynamical quantities by approximating boundary value problems using dynamical operators with a Galerkin expansion. A specific choice of basis set in the expansion corresponds to the estimation of dynamical quantities using a Markov state model. More generally, the boundary conditions impose restrictions on the choice of basis sets. We demonstrate how an alternative basis can be constructed using ideas from diffusion maps. In our numerical experiments, this basis gives results of comparable or better accuracy to Markov state models. Additionally, we show that delay embedding can reduce the information lost when projecting the system's dynamics for model construction; this improves estimates of dynamical statistics considerably over the standard practice of increasing the lag time.
Collapse
Affiliation(s)
- Erik H Thiede
- Department of Chemistry and James Franck Institute, The University of Chicago, Chicago, Illinois 60637, USA
| | - Dimitrios Giannakis
- Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, USA
| | - Aaron R Dinner
- Department of Chemistry and James Franck Institute, The University of Chicago, Chicago, Illinois 60637, USA
| | - Jonathan Weare
- Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, USA
| |
Collapse
|
10
|
Husic BE, Schlueter-Kuck KL, Dabiri JO. Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity. PLoS One 2019; 14:e0212442. [PMID: 30865644 PMCID: PMC6415781 DOI: 10.1371/journal.pone.0212442] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2018] [Accepted: 02/01/2019] [Indexed: 11/24/2022] Open
Abstract
The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience.
Collapse
Affiliation(s)
- Brooke E. Husic
- Department of Chemistry, Stanford University, Stanford, California, United States of America
- * E-mail: (BEH); (JOD)
| | - Kristy L. Schlueter-Kuck
- Department of Mechanical Engineering, Stanford University, Stanford, California, United States of America
| | - John O. Dabiri
- Department of Mechanical Engineering, Stanford University, Stanford, California, United States of America
- Department of Civil and Environmental Engineering, Stanford University, Stanford, California, United States of America
- * E-mail: (BEH); (JOD)
| |
Collapse
|
11
|
Narayan B, Herbert C, Yuan Y, Rodriguez BJ, Brooks BR, Buchete NV. Conformational analysis of replica exchange MD: Temperature-dependent Markov networks for FF amyloid peptides. J Chem Phys 2018; 149:072323. [PMID: 30134732 DOI: 10.1063/1.5027580] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Recent molecular modeling methods using Markovian descriptions of conformational states of biomolecular systems have led to powerful analysis frameworks that can accurately describe their complex dynamical behavior. In conjunction with enhanced sampling methods, such as replica exchange molecular dynamics (REMD), these frameworks allow the systematic and accurate extraction of transition probabilities between the corresponding states, in the case of Markov state models, and of statistically-optimized transition rates, in the case of the corresponding coarse master equations. However, applying automatically such methods to large molecular dynamics (MD) simulations, with explicit water molecules, remains limited both by the initial ability to identify good candidates for the underlying Markovian states and by the necessity to do so using good collective variables as reaction coordinates that allow the correct counting of inter-state transitions at various lag times. Here, we show that, in cases when representative molecular conformations can be identified for the corresponding Markovian states, and thus their corresponding collective evolution of atomic positions can be calculated along MD trajectories, one can use them to build a new type of simple collective variable, which can be particularly useful in both the correct state assignment and in the subsequent accurate counting of inter-state transition probabilities. In the case of the ubiquitously used root-mean-square deviation (RMSD) of atomic positions, we introduce the relative RMSD (RelRMSD) measure as a good reaction coordinate candidate. We apply this method to the analysis of REMD trajectories of amyloid-forming diphenylalanine (FF) peptides-a system with important nanotechnology and biomedical applications due to its self-assembling and piezoelectric properties-illustrating the use of RelRMSD in extracting its temperature-dependent intrinsic kinetics, without a priori assumptions on the functional form (e.g., Arrhenius or not) of the underlying conformational transition rates. The RelRMSD analysis enables as well a more objective assessment of the convergence of the REMD simulations. This type of collective variable may be generalized to other observables that could accurately capture conformational differences between the underlying Markov states (e.g., distance RMSD, the fraction of native contacts, etc.).
Collapse
Affiliation(s)
- Brajesh Narayan
- School of Physics, University College Dublin, Belfield, Dublin 4, Ireland
| | - Colm Herbert
- School of Physics, University College Dublin, Belfield, Dublin 4, Ireland
| | - Ye Yuan
- School of Physics, University College Dublin, Belfield, Dublin 4, Ireland
| | - Brian J Rodriguez
- School of Physics, University College Dublin, Belfield, Dublin 4, Ireland
| | - Bernard R Brooks
- Laboratory of Computational Biology, NHLBI, National Institutes of Health, Bethesda, Maryland 20892, USA
| | | |
Collapse
|
12
|
Affiliation(s)
- Brooke E. Husic
- Department of Chemistry, Stanford University, Stanford, California 94305, United States
| | - Vijay S. Pande
- Department of Chemistry, Stanford University, Stanford, California 94305, United States
| |
Collapse
|