1
|
Abstract
BACKGROUND We examine the problem of clustering biomolecular simulations using deep learning techniques. Since biomolecular simulation datasets are inherently high dimensional, it is often necessary to build low dimensional representations that can be used to extract quantitative insights into the atomistic mechanisms that underlie complex biological processes. RESULTS We use a convolutional variational autoencoder (CVAE) to learn low dimensional, biophysically relevant latent features from long time-scale protein folding simulations in an unsupervised manner. We demonstrate our approach on three model protein folding systems, namely Fs-peptide (14 μs aggregate sampling), villin head piece (single trajectory of 125 μs) and β- β- α (BBA) protein (223 + 102 μs sampling across two independent trajectories). In these systems, we show that the CVAE latent features learned correspond to distinct conformational substates along the protein folding pathways. The CVAE model predicts, on average, nearly 89% of all contacts within the folding trajectories correctly, while being able to extract folded, unfolded and potentially misfolded states in an unsupervised manner. Further, the CVAE model can be used to learn latent features of protein folding that can be applied to other independent trajectories, making it particularly attractive for identifying intrinsic features that correspond to conformational substates that share similar structural features. CONCLUSIONS Together, we show that the CVAE model can quantitatively describe complex biophysical processes such as protein folding.
Collapse
Affiliation(s)
- Debsindhu Bhowmik
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Shang Gao
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Michael T Young
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA
| | - Arvind Ramanathan
- Computational Science and Engineering Division, Oak Ridge National Laboratory, One Bethel Valley Road, MS6085, Oak Ridge, TN, USA.
| |
Collapse
|
2
|
Johnston T, Zhang B, Liwo A, Crivelli S, Taufer M. In situ data analytics and indexing of protein trajectories. J Comput Chem 2017; 38:1419-1430. [PMID: 28093787 DOI: 10.1002/jcc.24729] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Revised: 10/22/2016] [Accepted: 10/27/2016] [Indexed: 11/06/2022]
Abstract
The transition toward exascale computing will be accompanied by a performance dichotomy. Computational peak performance will rapidly increase; I/O performance will either grow slowly or be completely stagnant. Essentially, the rate at which data are generated will grow much faster than the rate at which data can be read from and written to the disk. MD simulations will soon face the I/O problem of efficiently writing to and reading from disk on the next generation of supercomputers. This article targets MD simulations at the exascale and proposes a novel technique for in situ data analysis and indexing of MD trajectories. Our technique maps individual trajectories' substructures (i.e., α-helices, β-strands) to metadata frame by frame. The metadata captures the conformational properties of the substructures. The ensemble of metadata can be used for automatic, strategic analysis within a trajectory or across trajectories, without manually identify those portions of trajectories in which critical changes take place. We demonstrate our technique's effectiveness by applying it to 26.3k helices and 31.2k strands from 9917 PDB proteins and by providing three empirical case studies. © 2017 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Travis Johnston
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Boyu Zhang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Adam Liwo
- Department of Theort, Chemistry, University of Gdansk, 80-952, Gdańsk, Poland
| | - Silvia Crivelli
- Department of Computer Science, University of California, Davis, CA, 95616, USA
| | - Michela Taufer
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA
| |
Collapse
|
3
|
C(α) torsion angles as a flexible criterion to extract secrets from a molecular dynamics simulation. J Mol Model 2014; 20:2196. [PMID: 24728650 DOI: 10.1007/s00894-014-2196-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Accepted: 03/02/2014] [Indexed: 02/02/2023]
Abstract
Given the increasing complexity of simulated molecular systems, and the fact that simulation times have now reached milliseconds to seconds, immense amounts of data (in the gigabyte to terabyte range) are produced in current molecular dynamics simulations. Manual analysis of these data is a very time-consuming task, and important events that lead from one intermediate structure to another can become occluded in the noise resulting from random thermal fluctuations. To overcome these problems and facilitate a semi-automated data analysis, we introduce in this work a measure based on C(α) torsion angles: torsion angles formed by four consecutive C(α) atoms. This measure describes changes in the backbones of large systems on a residual length scale (i.e., a small number of residues at a time). Cluster analysis of individual C(α) torsion angles and its fuzzification led to continuous time patches representing (meta)stable conformations and to the identification of events acting as transitions between these conformations. The importance of a change in torsion angle to structural integrity is assessed by comparing this change to the average fluctuations in the same torsion angle over the complete simulation. Using this novel measure in combination with other measures such as the root mean square deviation (RMSD) and time series of distance measures, we performed an in-depth analysis of a simulation of the open form of DNA polymerase I. The times at which major conformational changes occur and the most important parts of the molecule and their interrelations were pinpointed in this analysis. The simultaneous determination of the time points and localizations of major events is a significant advantage of the new bottom-up approach presented here, as compared to many other (top-down) approaches in which only the similarity of the complete structure is analyzed.
Collapse
|
4
|
Langmead CJ. Generative models of conformational dynamics. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2014; 805:87-105. [PMID: 24446358 PMCID: PMC4090804 DOI: 10.1007/978-3-319-02970-2_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
Abstract
Atomistic simulations of the conformational dynamics of proteins can be performed using either Molecular Dynamics or Monte Carlo procedures. The ensembles of three-dimensional structures produced during simulation can be analyzed in a number of ways to elucidate the thermodynamic and kinetic properties of the system. The goal of this chapter is to review both traditional and emerging methods for learning generative models from atomistic simulation data. Here, the term 'generative' refers to a model of the joint probability distribution over the behaviors of the constituent atoms. In the context of molecular modeling, generative models reveal the correlation structure between the atoms, and may be used to predict how the system will respond to structural perturbations. We begin by discussing traditional methods, which produce multivariate Gaussian models. We then discuss GAMELAN (GRAPHICAL MODELS OF ENERGY LANDSCAPES), which produces generative models of complex, non-Gaussian conformational dynamics (e.g., allostery, binding, folding, etc.) from long timescale simulation data.
Collapse
|
5
|
Ramanathan A, Savol AJ, Agarwal PK, Chennubhotla CS. Event detection and sub-state discovery from biomolecular simulations using higher-order statistics: application to enzyme adenylate kinase. Proteins 2012; 80:2536-51. [PMID: 22733562 DOI: 10.1002/prot.24135] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2012] [Revised: 05/08/2012] [Accepted: 06/10/2012] [Indexed: 12/25/2022]
Abstract
Biomolecular simulations at millisecond and longer time-scales can provide vital insights into functional mechanisms. Because post-simulation analyses of such large trajectory datasets can be a limiting factor in obtaining biological insights, there is an emerging need to identify key dynamical events and relating these events to the biological function online, that is, as simulations are progressing. Recently, we have introduced a novel computational technique, quasi-anharmonic analysis (QAA) (Ramanathan et al., PLoS One 2011;6:e15827), for partitioning the conformational landscape into a hierarchy of functionally relevant sub-states. The unique capabilities of QAA are enabled by exploiting anharmonicity in the form of fourth-order statistics for characterizing atomic fluctuations. In this article, we extend QAA for analyzing long time-scale simulations online. In particular, we present HOST4MD--a higher-order statistical toolbox for molecular dynamics simulations, which (1) identifies key dynamical events as simulations are in progress, (2) explores potential sub-states, and (3) identifies conformational transitions that enable the protein to access those sub-states. We demonstrate HOST4MD on microsecond timescale simulations of the enzyme adenylate kinase in its apo state. HOST4MD identifies several conformational events in these simulations, revealing how the intrinsic coupling between the three subdomains (LID, CORE, and NMP) changes during the simulations. Further, it also identifies an inherent asymmetry in the opening/closing of the two binding sites. We anticipate that HOST4MD will provide a powerful and extensible framework for detecting biophysically relevant conformational coordinates from long time-scale simulations.
Collapse
Affiliation(s)
- Arvind Ramanathan
- Computational Biology Institute & Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
| | | | | | | |
Collapse
|
6
|
Vitalis A, Caflisch A. Efficient Construction of Mesostate Networks from Molecular Dynamics Trajectories. J Chem Theory Comput 2012; 8:1108-20. [PMID: 26593370 DOI: 10.1021/ct200801b] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The coarse-graining of data from molecular simulations yields conformational space networks that may be used for predicting the system's long time scale behavior, to discover structural pathways connecting free energy basins in the system, or simply to represent accessible phase space regions of interest and their connectivities in a two-dimensional plot. In this contribution, we present a tree-based algorithm to partition conformations of biomolecules into sets of similar microstates, i.e., to coarse-grain trajectory data into mesostates. On account of utilizing an architecture similar to that of established tree-based algorithms, the proposed scheme operates in near-linear time with data set size. We derive expressions needed for the fast evaluation of mesostate properties and distances when employing typical choices for measures of similarity between microstates. Using both a pedagogically useful and a real-word application, the algorithm is shown to be robust with respect to tree height, which in addition to mesostate threshold size is the main adjustable parameter. It is demonstrated that the derived mesostate networks can preserve information regarding the free energy basins and barriers by which the system is characterized.
Collapse
Affiliation(s)
- Andreas Vitalis
- Department of Biochemistry, University of Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland
| | - Amedeo Caflisch
- Department of Biochemistry, University of Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland
| |
Collapse
|
7
|
Abstract
We introduce three algorithms for learning generative models of molecular structures from molecular dynamics simulations. The first algorithm learns a Bayesian-optimal undirected probabilistic model over user-specified covariates (e.g., fluctuations, distances, angles, etc). L1 reg-ularization is used to ensure sparse models and thus reduce the risk of over-fitting the data. The topology of the resulting model reveals important couplings between different parts of the protein, thus aiding in the analysis of molecular motions. The generative nature of the model makes it well-suited to making predictions about the global effects of local structural changes (e.g., the binding of an allosteric regulator). Additionally, the model can be used to sample new conformations. The second algorithm learns a time-varying graphical model where the topology and parameters change smoothly along the trajectory, revealing the conformational sub-states. The last algorithm learns a Markov Chain over undirected graphical models which can be used to study and simulate kinetics. We demonstrate our algorithms on multiple molecular dynamics trajectories.
Collapse
|
8
|
Savol AJ, Burger VM, Agarwal PK, Ramanathan A, Chennubhotla CS. QAARM: quasi-anharmonic autoregressive model reveals molecular recognition pathways in ubiquitin. Bioinformatics 2011; 27:i52-60. [PMID: 21685101 PMCID: PMC3117343 DOI: 10.1093/bioinformatics/btr248] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Molecular dynamics (MD) simulations have dramatically improved the atomistic understanding of protein motions, energetics and function. These growing datasets have necessitated a corresponding emphasis on trajectory analysis methods for characterizing simulation data, particularly since functional protein motions and transitions are often rare and/or intricate events. Observing that such events give rise to long-tailed spatial distributions, we recently developed a higher-order statistics based dimensionality reduction method, called quasi-anharmonic analysis (QAA), for identifying biophysically-relevant reaction coordinates and substates within MD simulations. Further characterization of conformation space should consider the temporal dynamics specific to each identified substate. RESULTS Our model uses hierarchical clustering to learn energetically coherent substates and dynamic modes of motion from a 0.5 μs ubiqutin simulation. Autoregressive (AR) modeling within and between states enables a compact and generative description of the conformational landscape as it relates to functional transitions between binding poses. Lacking a predictive component, QAA is extended here within a general AR model appreciative of the trajectory's temporal dependencies and the specific, local dynamics accessible to a protein within identified energy wells. These metastable states and their transition rates are extracted within a QAA-derived subspace using hierarchical Markov clustering to provide parameter sets for the second-order AR model. We show the learned model can be extrapolated to synthesize trajectories of arbitrary length. CONTACT ramanathana@ornl.gov; chakracs@pitt.edu.
Collapse
Affiliation(s)
- Andrej J Savol
- Joint Carnegie Mellon University-University of Pittsburgh Ph.D. Program in Computational Biology, Department of Computational and Systems Biology, University of Pittsburgh, PA 15260, USA
| | | | | | | | | |
Collapse
|