1
|
Lin X, Gao Y, Lei F. An application of topological data analysis in predicting sumoylation sites. PeerJ 2023; 11:e16204. [PMID: 37846308 PMCID: PMC10576966 DOI: 10.7717/peerj.16204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Accepted: 09/08/2023] [Indexed: 10/18/2023] Open
Abstract
Sumoylation is a reversible post-translational modification that regulates certain significant biochemical functions in proteins. The protein alterations caused by sumoylation are associated with the incidence of some human diseases. Therefore, identifying the sites of sumoylation in proteins may provide a direction for mechanistic research and drug development. Here, we propose a new computational approach for identifying sumoylation sites using an encoding method based on topological data analysis. The features of our model captured the key physical and biological properties of proteins at multiple scales. In a 10-fold cross validation, the outcomes of our model showed 96.45% of sensitivity (Sn), 94.65% of accuracy (Acc), 0.8946 of Matthew's correlation coefficient (MCC), and 0.99 of area under curve (AUC). The proposed predictor with only topological features achieves the best MCC and AUC in comparison to the other released methods. Our results suggest that topological information is an additional parameter that can assist in the prediction of sumoylation sites and provide a novel perspective for further research in protein sumoylation.
Collapse
Affiliation(s)
- Xiaoxi Lin
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning, China
| | - Yaru Gao
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning, China
| | - Fengchun Lei
- School of Mathematical Sciences, Dalian University of Technology, Dalian, Liaoning, China
| |
Collapse
|
2
|
Xia K, Liu X, Wee J. Persistent Homology for RNA Data Analysis. Methods Mol Biol 2023; 2627:211-229. [PMID: 36959450 DOI: 10.1007/978-1-0716-2974-1_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Abstract
Molecular representations are of great importance for machine learning models in RNA data analysis. Essentially, efficient molecular descriptors or fingerprints that characterize the intrinsic structural and interactional information of RNAs can significantly boost the performance of all learning modeling. In this paper, we introduce two persistent models, including persistent homology and persistent spectral, for RNA structure and interaction representations and their applications in RNA data analysis. Different from traditional geometric and graph representations, persistent homology is built on simplicial complex, which is a generalization of graph models to higher-dimensional situations. Hypergraph is a further generalization of simplicial complexes and hypergraph-based embedded persistent homology has been proposed recently. Moreover, persistent spectral models, which combine filtration process with spectral models, including spectral graph, spectral simplicial complex, and spectral hypergraph, are proposed for molecular representation. The persistent attributes for RNAs can be obtained from these two persistent models and further combined with machine learning models for RNA structure, flexibility, dynamics, and function analysis.
Collapse
Affiliation(s)
- Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore.
| | - Xiang Liu
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
- Chern Institute of Mathematics and LPMC, Nankai University, Tianjin, China
| | - JunJie Wee
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
3
|
Liu J, Xia KL, Wu J, Yau SST, Wei GW. Biomolecular Topology: Modelling and Analysis. ACTA MATHEMATICA SINICA, ENGLISH SERIES 2022; 38:1901-1938. [PMID: 36407804 PMCID: PMC9640850 DOI: 10.1007/s10114-022-2326-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 07/12/2022] [Indexed: 05/25/2023]
Abstract
With the great advancement of experimental tools, a tremendous amount of biomolecular data has been generated and accumulated in various databases. The high dimensionality, structural complexity, the nonlinearity, and entanglements of biomolecular data, ranging from DNA knots, RNA secondary structures, protein folding configurations, chromosomes, DNA origami, molecular assembly, to others at the macromolecular level, pose a severe challenge in their analysis and characterization. In the past few decades, mathematical concepts, models, algorithms, and tools from algebraic topology, combinatorial topology, computational topology, and topological data analysis, have demonstrated great power and begun to play an essential role in tackling the biomolecular data challenge. In this work, we introduce biomolecular topology, which concerns the topological problems and models originated from the biomolecular systems. More specifically, the biomolecular topology encompasses topological structures, properties and relations that are emerged from biomolecular structures, dynamics, interactions, and functions. We discuss the various types of biomolecular topology from structures (of proteins, DNAs, and RNAs), protein folding, and protein assembly. A brief discussion of databanks (and databases), theoretical models, and computational algorithms, is presented. Further, we systematically review related topological models, including graphs, simplicial complexes, persistent homology, persistent Laplacians, de Rham-Hodge theory, Yau-Hausdorff distance, and the topology-based machine learning models.
Collapse
Affiliation(s)
- Jian Liu
- School of Mathematical Sciences, Hebei Normal University, Shijiazhuang, 050024 P. R. China
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
| | - Ke-Lin Xia
- School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, 639798 Singapore
| | - Jie Wu
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 P. R. China
| | - Stephen Shing-Toung Yau
- Yanqi Lake Beijing Institute of Mathematical Sciences and Applications, Beijing, 101408 P. R. China
- Department of Mathematical Sciences, Tsinghua University, Beijing, 100084 P. R. China
| | - Guo-Wei Wei
- Department of Mathematics & Department of Biochemistry and Molecular Biology & Department of Electrical and Computer Engineering, Michigan State University, Wells Hall 619 Red Cedar Road, East Lansing, MI 48824-1027 USA
| |
Collapse
|
4
|
Pun CS, Lee SX, Xia K. Persistent-homology-based machine learning: a survey and a comparative study. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10146-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
5
|
Ichinomiya T. Topological data analysis gives two folding paths in HP35(nle-nle), double mutant of villin headpiece subdomain. Sci Rep 2022; 12:2719. [PMID: 35177744 PMCID: PMC8854739 DOI: 10.1038/s41598-022-06682-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 02/04/2022] [Indexed: 11/16/2022] Open
Abstract
The folding dynamics of proteins is a primary area of interest in protein science. We carried out topological data analysis (TDA) of the folding process of HP35(nle-nle), a double-mutant of the villin headpiece subdomain. Using persistent homology and non-negative matrix factorization, we reduced the dimension of protein structure and investigated the flow in the reduced space. We found this protein has two folding paths, distinguished by the pairings of inter-helix residues. Our analysis showed the excellent performance of TDA in capturing the formation of tertiary structure.
Collapse
Affiliation(s)
- Takashi Ichinomiya
- Department of Systems Biology, Gifu University School of Medicine, Yanagido 1-1, Gifu, 501-1194, Japan. .,The United Graduate School of Drug Discovery and Medical Information Sciences of Gifu University, Yanagido 1-1, Gifu, 501-1194, Japan.
| |
Collapse
|
6
|
Chen J, Zhao R, Tong Y, Wei GW. EVOLUTIONARY DE RHAM-HODGE METHOD. DISCRETE AND CONTINUOUS DYNAMICAL SYSTEMS. SERIES B 2021; 26:3785-3821. [PMID: 34675756 DOI: 10.3934/dcdsb.2020257] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The de Rham-Hodge theory is a landmark of the 20th Century's mathematics and has had a great impact on mathematics, physics, computer science, and engineering. This work introduces an evolutionary de Rham-Hodge method to provide a unified paradigm for the multiscale geometric and topological analysis of evolving manifolds constructed from a filtration, which induces a family of evolutionary de Rham complexes. While the present method can be easily applied to close manifolds, the emphasis is given to more challenging compact manifolds with 2-manifold boundaries, which require appropriate analysis and treatment of boundary conditions on differential forms to maintain proper topological properties. Three sets of unique evolutionary Hodge Laplacians are proposed to generate three sets of topology-preserving singular spectra, for which the multiplicities of zero eigenvalues correspond to exactly the persistent Betti numbers of dimensions 0, 1 and 2. Additionally, three sets of non-zero eigenvalues further reveal both topological persistence and geometric progression during the manifold evolution. Extensive numerical experiments are carried out via the discrete exterior calculus to demonstrate the potential of the proposed paradigm for data representation and shape analysis of both point cloud data and density maps. To demonstrate the utility of the proposed method, the application is considered to the protein B-factor predictions of a few challenging cases for which existing biophysical models break down.
Collapse
Affiliation(s)
- Jiahui Chen
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Rundong Zhao
- Department of Computer Science and Engineering, Michigan State University, MI 48824, USA
| | - Yiying Tong
- Department of Computer Science and Engineering, Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA
| |
Collapse
|
7
|
Yen PTW, Xia K, Cheong SA. Understanding Changes in the Topology and Geometry of Financial Market Correlations during a Market Crash. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1211. [PMID: 34573837 PMCID: PMC8467365 DOI: 10.3390/e23091211] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Revised: 09/05/2021] [Accepted: 09/06/2021] [Indexed: 12/24/2022]
Abstract
In econophysics, the achievements of information filtering methods over the past 20 years, such as the minimal spanning tree (MST) by Mantegna and the planar maximally filtered graph (PMFG) by Tumminello et al., should be celebrated. Here, we show how one can systematically improve upon this paradigm along two separate directions. First, we used topological data analysis (TDA) to extend the notions of nodes and links in networks to faces, tetrahedrons, or k-simplices in simplicial complexes. Second, we used the Ollivier-Ricci curvature (ORC) to acquire geometric information that cannot be provided by simple information filtering. In this sense, MSTs and PMFGs are but first steps to revealing the topological backbones of financial networks. This is something that TDA can elucidate more fully, following which the ORC can help us flesh out the geometry of financial networks. We applied these two approaches to a recent stock market crash in Taiwan and found that, beyond fusions and fissions, other non-fusion/fission processes such as cavitation, annihilation, rupture, healing, and puncture might also be important. We also successfully identified neck regions that emerged during the crash, based on their negative ORCs, and performed a case study on one such neck region.
Collapse
Affiliation(s)
- Peter Tsung-Wen Yen
- Center for Crystal Researches, National Sun Yet-Sen University, No. 70, Lien-hai Rd., Kaohsiung 80424, Taiwan;
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore 637371, Singapore;
| | - Siew Ann Cheong
- Division of Physics and Applied Physics, School of Physical and Mathematical Sciences, Nanyang Technological University, 21 Nanyang Link, Singapore 637371, Singapore
| |
Collapse
|
8
|
Terebus A, Manuchehrfar F, Cao Y, Liang J. Exact Probability Landscapes of Stochastic Phenotype Switching in Feed-Forward Loops: Phase Diagrams of Multimodality. Front Genet 2021; 12:645640. [PMID: 34306004 PMCID: PMC8297706 DOI: 10.3389/fgene.2021.645640] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 04/26/2021] [Indexed: 11/13/2022] Open
Abstract
Feed-forward loops (FFLs) are among the most ubiquitously found motifs of reaction networks in nature. However, little is known about their stochastic behavior and the variety of network phenotypes they can exhibit. In this study, we provide full characterizations of the properties of stochastic multimodality of FFLs, and how switching between different network phenotypes are controlled. We have computed the exact steady-state probability landscapes of all eight types of coherent and incoherent FFLs using the finite-butter Accurate Chemical Master Equation (ACME) algorithm, and quantified the exact topological features of their high-dimensional probability landscapes using persistent homology. Through analysis of the degree of multimodality for each of a set of 10,812 probability landscapes, where each landscape resides over 105–106 microstates, we have constructed comprehensive phase diagrams of all relevant behavior of FFL multimodality over broad ranges of input and regulation intensities, as well as different regimes of promoter binding dynamics. In addition, we have quantified the topological sensitivity of the multimodality of the landscapes to regulation intensities. Our results show that with slow binding and unbinding dynamics of transcription factor to promoter, FFLs exhibit strong stochastic behavior that is very different from what would be inferred from deterministic models. In addition, input intensity play major roles in the phenotypes of FFLs: At weak input intensity, FFL exhibit monomodality, but strong input intensity may result in up to 6 stable phenotypes. Furthermore, we found that gene duplication can enlarge stable regions of specific multimodalities and enrich the phenotypic diversity of FFL networks, providing means for cells toward better adaptation to changing environment. Our results are directly applicable to analysis of behavior of FFLs in biological processes such as stem cell differentiation and for design of synthetic networks when certain phenotypic behavior is desired.
Collapse
Affiliation(s)
- Anna Terebus
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States.,Constellation, Baltimore, MD, United States
| | - Farid Manuchehrfar
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States
| | - Youfang Cao
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States.,Merck & Co., Inc., Kenilworth, NJ, United States
| | - Jie Liang
- Center for Bioinformatics and Quantitative Biology, Richard and Loan Hill Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States
| |
Collapse
|
9
|
Manuchehrfar F, Li H, Tian W, Ma A, Liang J. Exact Topology of the Dynamic Probability Surface of an Activated Process by Persistent Homology. J Phys Chem B 2021; 125:4667-4680. [PMID: 33938737 DOI: 10.1021/acs.jpcb.1c00904] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
To gain insight into the reaction mechanism of activated processes, we introduce an exact approach for quantifying the topology of high-dimensional probability surfaces of the underlying dynamic processes. Instead of Morse indexes, we study the homology groups of a sequence of superlevel sets of the probability surface over high-dimensional configuration spaces using persistent homology. For alanine-dipeptide isomerization, a prototype of activated processes, we identify locations of probability peaks and connecting ridges, along with measures of their global prominence. Instead of a saddle point, the transition state ensemble (TSE) of conformations is at the most prominent probability peak after reactants/products, when proper reaction coordinates are included. Intuition-based models, even those exhibiting a double-well, fail to capture the dynamics of the activated process. Peak occurrence, prominence, and locations can be distorted upon subspace projection. While principal component analysis accounts for conformational variance, it inflates the complexity of the surface topology and destroys the dynamic properties of the topological features. In contrast, TSE emerges naturally as the most prominent peak beyond the reactant/product basins, when projected to a subspace of minimum dimension containing the reaction coordinates. Our approach is general and can be applied to investigate the topology of high-dimensional probability surfaces of other activated processes.
Collapse
Affiliation(s)
- Farid Manuchehrfar
- Center for Bioinformatics and Quantiative Biology and Department of Bioengneering, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| | - Huiyu Li
- Center for Bioinformatics and Quantiative Biology and Department of Bioengneering, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| | - Wei Tian
- Center for Bioinformatics and Quantiative Biology and Department of Bioengneering, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| | - Ao Ma
- Center for Bioinformatics and Quantiative Biology and Department of Bioengneering, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| | - Jie Liang
- Center for Bioinformatics and Quantiative Biology and Department of Bioengneering, University of Illinois at Chicago, Chicago, Illinois 60607, United States
| |
Collapse
|
10
|
Cang Z, Munch E, Wei GW. Evolutionary homology on coupled dynamical systems with applications to protein flexibility analysis. ACTA ACUST UNITED AC 2020; 4:481-507. [PMID: 34179350 DOI: 10.1007/s41468-020-00057-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
While the spatial topological persistence is naturally constructed from a radius-based filtration, it has hardly been derived from a temporal filtration. Most topological models are designed for the global topology of a given object as a whole. There is no method reported in the literature for the topology of an individual component in an object to the best of our knowledge. For many problems in science and engineering, the topology of an individual component is important for describing its properties. We propose evolutionary homology (EH) constructed via a time evolution-based filtration and topological persistence. Our approach couples a set of dynamical systems or chaotic oscillators by the interactions of a physical system, such as a macromolecule. The interactions are approximated by weighted graph Laplacians. Simplices, simplicial complexes, algebraic groups and topological persistence are defined on the coupled trajectories of the chaotic oscillators. The resulting EH gives rise to time-dependent topological invariants or evolutionary barcodes for an individual component of the physical system, revealing its topology-function relationship. In conjunction with Wasserstein metrics, the proposed EH is applied to protein flexibility analysis, an important problem in computational biophysics. Numerical results for the B-factor prediction of a benchmark set of 364 proteins indicate that the proposed EH outperforms all the other state-of-the-art methods in the field.
Collapse
Affiliation(s)
- Zixuan Cang
- Department of Mathematics, Michigan State University
| | - Elizabeth Munch
- Department of Computational Mathematics, Science and Engineering, Michigan State University
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University
| |
Collapse
|
11
|
Weighted persistent homology for osmolyte molecular aggregation and hydrogen-bonding network analysis. Sci Rep 2020; 10:9685. [PMID: 32546801 PMCID: PMC7297731 DOI: 10.1038/s41598-020-66710-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Accepted: 05/20/2020] [Indexed: 12/24/2022] Open
Abstract
It has long been observed that trimethylamine N-oxide (TMAO) and urea demonstrate dramatically different properties in a protein folding process. Even with the enormous theoretical and experimental research work on these two osmolytes, various aspects of their underlying mechanisms still remain largely elusive. In this paper, we propose to use the weighted persistent homology to systematically study the osmolytes molecular aggregation and their hydrogen-bonding network from a local topological perspective. We consider two weighted models, i.e., localized persistent homology (LPH) and interactive persistent homology (IPH). Boltzmann persistent entropy (BPE) is proposed to quantitatively characterize the topological features from LPH and IPH, together with persistent Betti number (PBN). More specifically, from the localized persistent homology models, we have found that TMAO and urea have very different local topology. TMAO is found to exhibit a local network structure. With the concentration increase, the circle elements in these networks show a clear increase in their total numbers and a decrease in their relative sizes. In contrast, urea shows two types of local topological patterns, i.e., local clusters around 6 Å and a few global circle elements at around 12 Å. From the interactive persistent homology models, it has been found that our persistent radial distribution function (PRDF) from the global-scale IPH has same physical properties as the traditional radial distribution function. Moreover, PRDFs from the local-scale IPH can also be generated and used to characterize the local interaction information. Other than the clear difference of the first peak value of PRDFs at filtration size 4 Å, TMAO and urea also shows very different behaviors at the second peak region from filtration size 5 Å to 10 Å. These differences are also reflected in the PBNs and BPEs of the local-scale IPH. These localized topological information has never been revealed before. Since graphs can be transferred into simplicial complexes by the clique complex, our weighted persistent homology models can be used in the analysis of various networks and graphs from any molecular structures and aggregation systems.
Collapse
|
12
|
Cang Z, Wei GW. Persistent Cohomology for Data With Multicomponent Heterogeneous Information. SIAM JOURNAL ON MATHEMATICS OF DATA SCIENCE 2020; 2:396-418. [PMID: 34222831 PMCID: PMC8249079 DOI: 10.1137/19m1272226] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Persistent homology is a powerful tool for characterizing the topology of a data set at various geometric scales. When applied to the description of molecular structures, persistent homology can capture the multiscale geometric features and reveal certain interaction patterns in terms of topological invariants. However, in addition to the geometric information, there is a wide variety of nongeometric information of molecular structures, such as element types, atomic partial charges, atomic pairwise interactions, and electrostatic potential functions, that is not described by persistent homology. Although element-specific homology and electrostatic persistent homology can encode some nongeometric information into geometry based topological invariants, it is desirable to have a mathematical paradigm to systematically embed both geometric and nongeometric information, i.e., multicomponent heterogeneous information, into unified topological representations. To this end, we propose a persistent cohomology based framework for the enriched representation of data. In our framework, nongeometric information can either be distributed globally or reside locally on the datasets in the geometric sense and can be properly defined on topological spaces, i.e., simplicial complexes. Using the proposed persistent cohomology based framework, enriched barcodes are extracted from datasets to represent heterogeneous information. We consider a variety of datasets to validate the present formulation and illustrate the usefulness of the proposed method based on persistent cohomology. It is found that the proposed framework outperforms or at least matches the state-of-the-art methods in the protein-ligand binding affinity prediction from massive biomolecular datasets without resorting to any deep learning formulation.
Collapse
Affiliation(s)
- Zixuan Cang
- Department of Mathematics, Michigan State University, East Lansing, MI 48824
| | - Guo-Wei Wei
- Department of Mathematics, Department of Biochemistry and Molecular Biology, Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, 48824
| |
Collapse
|
13
|
Ichinomiya T, Obayashi I, Hiraoka Y. Protein-Folding Analysis Using Features Obtained by Persistent Homology. Biophys J 2020; 118:2926-2937. [PMID: 32428439 PMCID: PMC7300307 DOI: 10.1016/j.bpj.2020.04.032] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Revised: 02/16/2020] [Accepted: 04/17/2020] [Indexed: 10/25/2022] Open
Abstract
Understanding the protein-folding process is an outstanding issue in biophysics; recent developments in molecular dynamics simulation have provided insights into this phenomenon. However, the large freedom of atomic motion hinders the understanding of this process. In this study, we applied persistent homology, an emerging method to analyze topological features in a data set, to reveal protein-folding dynamics. We developed a new, to our knowledge, method to characterize the protein structure based on persistent homology and applied this method to molecular dynamics simulations of chignolin. Using principle component analysis or nonnegative matrix factorization, our analysis method revealed two stable states and one saddle state, corresponding to the native, misfolded, and transition states, respectively. We also identified an unfolded state with slow dynamics in the reduced space. Our method serves as a promising tool to understand the protein-folding process.
Collapse
Affiliation(s)
- Takashi Ichinomiya
- Gifu University School of Medicine, Gifu, Japan; The United Graduate School of Drug Discovery and Medical Information Sciences of Gifu University, Gifu, Japan.
| | - Ippei Obayashi
- Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
| | - Yasuaki Hiraoka
- WPI-ASHBi, Kyoto University Institute for Advanced Study, Kyoto University, Kyoto, Japan; Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
| |
Collapse
|
14
|
Abstract
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on protein-ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energies, toxicity, partition coefficients, protein folding stability changes upon mutation, etc.
Collapse
Affiliation(s)
- Duc Duy Nguyen
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Zixuan Cang
- Department of Mathematics, Michigan State University, MI 48824, USA.
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, MI 48824, USA. and Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA and Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA
| |
Collapse
|
15
|
Weighted persistent homology for biomolecular data analysis. Sci Rep 2020; 10:2079. [PMID: 32034168 PMCID: PMC7005716 DOI: 10.1038/s41598-019-55660-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Accepted: 11/29/2019] [Indexed: 11/08/2022] Open
Abstract
In this paper, we systematically review weighted persistent homology (WPH) models and their applications in biomolecular data analysis. Essentially, the weight value, which reflects physical, chemical and biological properties, can be assigned to vertices (atom centers), edges (bonds), or higher order simplexes (cluster of atoms), depending on the biomolecular structure, function, and dynamics properties. Further, we propose the first localized weighted persistent homology (LWPH). Inspired by the great success of element specific persistent homology (ESPH), we do not treat biomolecules as an inseparable system like all previous weighted models, instead we decompose them into a series of local domains, which may be overlapped with each other. The general persistent homology or weighted persistent homology analysis is then applied on each of these local domains. In this way, functional properties, that are embedded in local structures, can be revealed. Our model has been applied to systematically study DNA structures. It has been found that our LWPH based features can be used to successfully discriminate the A-, B-, and Z-types of DNA. More importantly, our LWPH based principal component analysis (PCA) model can identify two configurational states of DNA structures in ion liquid environment, which can be revealed only by the complicated helical coordinate system. The great consistence with the helical-coordinate model demonstrates that our model captures local structure variations so well that it is comparable with geometric models. Moreover, geometric measurements are usually defined in local regions. For instance, the helical-coordinate system is limited to one or two basepairs. However, our LWPH can quantitatively characterize structure information in regions or domains with arbitrary sizes and shapes, where traditional geometrical measurements fail.
Collapse
|
16
|
Bramer D, Wei GW. Atom-specific persistent homology and its application to protein flexibility analysis. COMPUTATIONAL AND MATHEMATICAL BIOPHYSICS 2020; 8:1-35. [PMID: 34278230 PMCID: PMC8281920 DOI: 10.1515/cmb-2020-0001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
Recently, persistent homology has had tremendous success in biomolecular data analysis. It works by examining the topological relationship or connectivity of a group of atoms in a molecule at a variety of scales, then rendering a family of topological representations of the molecule. However, persistent homology is rarely employed for the analysis of atomic properties, such as biomolecular flexibility analysis or B-factor prediction. This work introduces atom-specific persistent homology to provide a local atomic level representation of a molecule via a global topological tool. This is achieved through the construction of a pair of conjugated sets of atoms and corresponding conjugated simplicial complexes, as well as conjugated topological spaces. The difference between the topological invariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics and leads to an atom-specific topological representation of individual atomic properties in a molecule. Atom-specific topological features are integrated with various machine learning algorithms, including gradient boosting trees and convolutional neural network for protein thermal fluctuation analysis and B-factor prediction. Extensive numerical results indicate the proposed method provides a powerful topological tool for analyzing and predicting localized information in complex macromolecules.
Collapse
Affiliation(s)
- David Bramer
- Department of Mathematics, Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Corresponding Author: Guo-WeiWei: Department of Mathematics, Michigan State University, MI 48824, USA; Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA; Department of Electrical and Computer Engineering, Michigan State University, MI 48824, USA,
| |
Collapse
|
17
|
Ulmer M, Ziegelmeier L, Topaz CM. A topological approach to selecting models of biological experiments. PLoS One 2019; 14:e0213679. [PMID: 30875410 PMCID: PMC6420156 DOI: 10.1371/journal.pone.0213679] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 02/26/2019] [Indexed: 11/22/2022] Open
Abstract
We use topological data analysis as a tool to analyze the fit of mathematical models to experimental data. This study is built on data obtained from motion tracking groups of aphids in [Nilsen et al., PLOS One, 2013] and two random walk models that were proposed to describe the data. One model incorporates social interactions between the insects via a functional dependence on an aphid's distance to its nearest neighbor. The second model is a control model that ignores this dependence. We compare data from each model to data from experiment by performing statistical tests based on three different sets of measures. First, we use time series of order parameters commonly used in collective motion studies. These order parameters measure the overall polarization and angular momentum of the group, and do not rely on a priori knowledge of the models that produced the data. Second, we use order parameter time series that do rely on a priori knowledge, namely average distance to nearest neighbor and percentage of aphids moving. Third, we use computational persistent homology to calculate topological signatures of the data. Analysis of the a priori order parameters indicates that the interactive model better describes the experimental data than the control model does. The topological approach performs as well as these a priori order parameters and better than the other order parameters, suggesting the utility of the topological approach in the absence of specific knowledge of mechanisms underlying the data.
Collapse
Affiliation(s)
- M. Ulmer
- Department of Mathematics, Statistics, and Computer Science, Macalester College, Saint Paul, Minnesota, United States of America
| | - Lori Ziegelmeier
- Department of Mathematics, Statistics, and Computer Science, Macalester College, Saint Paul, Minnesota, United States of America
| | - Chad M. Topaz
- Department of Mathematics and Statistics, Williams College, Williamstown, Massachusetts, United States of America
| |
Collapse
|
18
|
Grow C, Gao K, Nguyen DD, Wei GW. Generative network complex (GNC) for drug discovery. COMMUNICATIONS IN INFORMATION AND SYSTEMS 2019; 19:241-277. [PMID: 34257523 PMCID: PMC8274326 DOI: 10.4310/cis.2019.v19.n3.a2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
It remains a challenging task to generate a vast variety of novel compounds with desirable pharmacological properties. In this work, a generative network complex (GNC) is proposed as a new platform for designing novel compounds, predicting their physical and chemical properties, and selecting potential drug candidates that fulfill various druggable criteria such as binding affinity, solubility, partition coefficient, etc. We combine a SMILES string generator, which consists of an encoder, a drug-property controlled or regulated latent space, and a decoder, with verification deep neural networks, a target-specific three-dimensional (3D) pose generator, and mathematical deep learning networks to generate new compounds, predict their drug properties, construct 3D poses associated with target proteins, and reevaluate druggability, respectively. New compounds were generated in the latent space by either randomized output, controlled output, or optimized output. In our demonstration, 2.08 million and 2.8 million novel compounds are generated respectively for Cathepsin S and BACE targets. These new compounds are very different from the seeds and cover a larger chemical space. For potentially active compounds, their 3D poses are generated using a state-of-the-art method. The resulting 3D complexes are further evaluated for druggability by a championing deep learning algorithm based on algebraic topology, differential geometry, and algebraic graph theories. Performed on supercomputers, the whole process took less than one week. Therefore, our GNC is an efficient new paradigm for discovering new drug candidates.
Collapse
Affiliation(s)
- Christopher Grow
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Kaifu Gao
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Duc Duy Nguyen
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
19
|
Xia K, Anand DV, Shikhar S, Mu Y. Persistent homology analysis of osmolyte molecular aggregation and their hydrogen-bonding networks. Phys Chem Chem Phys 2019; 21:21038-21048. [DOI: 10.1039/c9cp03009c] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Dramatically different patterns can be observed in the topological fingerprints for hydrogen-bonding networks from two types of osmolyte systems.
Collapse
Affiliation(s)
- Kelin Xia
- Division of Mathematical Sciences
- School of Physical and Mathematical Sciences
- School of Biological Sciences
- Nanyang Technological University
- Singapore
| | - D. Vijay Anand
- Division of Mathematical Sciences
- School of Physical and Mathematical Sciences
- School of Biological Sciences
- Nanyang Technological University
- Singapore
| | - Saxena Shikhar
- School of Biological Sciences
- Nanyang Technological University
- Singapore
| | - Yuguang Mu
- School of Biological Sciences
- Nanyang Technological University
- Singapore
| |
Collapse
|
20
|
Xia K. Persistent homology analysis of ion aggregations and hydrogen-bonding networks. Phys Chem Chem Phys 2018; 20:13448-13460. [PMID: 29722784 DOI: 10.1039/c8cp01552j] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Despite the great advancement of experimental tools and theoretical models, a quantitative characterization of the microscopic structures of ion aggregates and their associated water hydrogen-bonding networks still remains a challenging problem. In this paper, a newly-invented mathematical method called persistent homology is introduced, for the first time, to quantitatively analyze the intrinsic topological properties of ion aggregation systems and hydrogen-bonding networks. The two most distinguishable properties of persistent homology analysis of assembly systems are as follows. First, it does not require a predefined bond length to construct the ion or hydrogen-bonding network. Persistent homology results are determined by the morphological structure of the data only. Second, it can directly measure the size of circles or holes in ion aggregates and hydrogen-bonding networks. To validate our model, we consider two well-studied systems, i.e., NaCl and KSCN solutions, generated from molecular dynamics simulations. They are believed to represent two morphological types of aggregation, i.e., local clusters and extended ion networks. It has been found that the two aggregation types have distinguishable topological features and can be characterized by our topological model very well. Further, we construct two types of networks, i.e., O-networks and H2O-networks, for analyzing the topological properties of hydrogen-bonding networks. It is found that for both models, KSCN systems demonstrate much more dramatic variations in their local circle structures with a concentration increase. A consistent increase of large-sized local circle structures is observed and the sizes of these circles become more and more diverse. In contrast, NaCl systems show no obvious increase of large-sized circles. Instead a consistent decline of the average size of the circle structures is observed and the sizes of these circles become more and more uniform with a concentration increase. As far as we know, these unique intrinsic topological features in ion aggregation systems have never been pointed out before. More importantly, our models can be directly used to quantitatively analyze the intrinsic topological invariants, including circles, loops, holes, and cavities, of any network-like structures, such as nanomaterials, colloidal systems, biomolecular assemblies, among others. These topological invariants cannot be described by traditional graph and network models.
Collapse
Affiliation(s)
- Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, School of Biological Sciences, Nanyang Technological University, 637371, Singapore.
| |
Collapse
|
21
|
Xia K. Sequence-based multiscale modeling for high-throughput chromosome conformation capture (Hi-C) data analysis. PLoS One 2018; 13:e0191899. [PMID: 29408904 PMCID: PMC5800693 DOI: 10.1371/journal.pone.0191899] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2017] [Accepted: 01/12/2018] [Indexed: 11/18/2022] Open
Abstract
In this paper, we introduce sequence-based multiscale modeling for biomolecular data analysis. We employ spectral clustering method in our modeling and reveal the difference between sequence-based global scale clustering and local scale clustering. Essentially, two types of distances, i.e., Euclidean (or spatial) distance and genomic (or sequential) distance, can be used in data clustering. Clusters from sequence-based global scale models optimize spatial distances, meaning spatially adjacent loci are more likely to be assigned into the same cluster. Sequence-based local scale models, on the other hand, result in clusters that optimize genomic distances. That is to say, in these models, sequentially adjoining loci tend to be cluster together. We propose two sequence-based multiscale models (SeqMMs) for the study of chromosome hierarchical structures, including genomic compartments and topological associated domains (TADs). We find that genomic compartments are determined only by global scale information in the Hi-C data. The removal of all the local interactions within a band region as large as 10 Mb in genomic distance has almost no significant influence on the final compartment results. Further, in TAD analysis, we find that when the sequential scale is small, a tiny variation of diagonal band region in a contact map will result in a great change in the predicted TAD boundaries. When the scale value is larger than a threshold value, the TAD boundaries become very consistent. This threshold value is highly related to TAD sizes. By the comparison of our results with those previously obtained using a spectral clustering model, we find that our method is more robust and reliable. Finally, we demonstrate that almost all TAD boundaries from both clustering methods are local minimum of a TAD summation function.
Collapse
Affiliation(s)
- Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore 637371, Singapore
| |
Collapse
|
22
|
Cang Z, Mu L, Wei GW. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput Biol 2018; 14:e1005929. [PMID: 29309403 PMCID: PMC5774846 DOI: 10.1371/journal.pcbi.1005929] [Citation(s) in RCA: 139] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2017] [Revised: 01/19/2018] [Accepted: 12/15/2017] [Indexed: 12/05/2022] Open
Abstract
This work introduces a number of algebraic topology approaches, including multi-component persistent homology, multi-level persistent homology, and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. In contrast to the conventional persistent homology, multi-component persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for protein-ligand binding analysis and virtual screening of small molecules. Extensive numerical experiments involving 4,414 protein-ligand complexes from the PDBBind database and 128,374 ligand-target and decoy-target pairs in the DUD database are performed to test respectively the scoring power and the discriminatory power of the proposed topological learning strategies. It is demonstrated that the present topological learning outperforms other existing methods in protein-ligand binding affinity prediction and ligand-decoy discrimination.
Collapse
Affiliation(s)
- Zixuan Cang
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
| | - Lin Mu
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, United States of America
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
23
|
Centralities in simplicial complexes. Applications to protein interaction networks. J Theor Biol 2017; 438:46-60. [PMID: 29128505 DOI: 10.1016/j.jtbi.2017.11.003] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2017] [Revised: 11/01/2017] [Accepted: 11/07/2017] [Indexed: 01/01/2023]
Abstract
Complex networks can be used to represent complex systems which originate in the real world. Here we study a transformation of these complex networks into simplicial complexes, where cliques represent the simplices of the complex. We extend the concept of node centrality to that of simplicial centrality and study several mathematical properties of degree, closeness, betweenness, eigenvector, Katz, and subgraph centrality for simplicial complexes. We study the degree distributions of these centralities at the different levels. We also compare and describe the differences between the centralities at the different levels. Using these centralities we study a method for detecting essential proteins in PPI networks of cells and explain the varying abilities of the centrality measures at the different levels in identifying these essential proteins.
Collapse
|
24
|
Multiscale Persistent Functions for Biomolecular Structure Characterization. Bull Math Biol 2017; 80:1-31. [PMID: 29098540 DOI: 10.1007/s11538-017-0362-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Accepted: 10/19/2017] [Indexed: 10/18/2022]
Abstract
In this paper, we introduce multiscale persistent functions for biomolecular structure characterization. The essential idea is to combine our multiscale rigidity functions (MRFs) with persistent homology analysis, so as to construct a series of multiscale persistent functions, particularly multiscale persistent entropies, for structure characterization. To clarify the fundamental idea of our method, the multiscale persistent entropy (MPE) model is discussed in great detail. Mathematically, unlike the previous persistent entropy (Chintakunta et al. in Pattern Recognit 48(2):391-401, 2015; Merelli et al. in Entropy 17(10):6872-6892, 2015; Rucco et al. in: Proceedings of ECCS 2014, Springer, pp 117-128, 2016), a special resolution parameter is incorporated into our model. Various scales can be achieved by tuning its value. Physically, our MPE can be used in conformational entropy evaluation. More specifically, it is found that our method incorporates in it a natural classification scheme. This is achieved through a density filtration of an MRF built from angular distributions. To further validate our model, a systematical comparison with the traditional entropy evaluation model is done. It is found that our model is able to preserve the intrinsic topological features of biomolecular data much better than traditional approaches, particularly for resolutions in the intermediate range. Moreover, by comparing with traditional entropies from various grid sizes, bond angle-based methods and a persistent homology-based support vector machine method (Cang et al. in Mol Based Math Biol 3:140-162, 2015), we find that our MPE method gives the best results in terms of average true positive rate in a classic protein structure classification test. More interestingly, all-alpha and all-beta protein classes can be clearly separated from each other with zero error only in our model. Finally, a special protein structure index (PSI) is proposed, for the first time, to describe the "regularity" of protein structures. Basically, a protein structure is deemed as regular if it has a consistent and orderly configuration. Our PSI model is tested on a database of 110 proteins; we find that structures with larger portions of loops and intrinsically disorder regions are always associated with larger PSI, meaning an irregular configuration, while proteins with larger portions of secondary structures, i.e., alpha-helix or beta-sheet, have smaller PSI. Essentially, PSI can be used to describe the "regularity" information in any systems.
Collapse
|
25
|
Baldwin PR, Tan YZ, Eng ET, Rice WJ, Noble AJ, Negro CJ, Cianfrocco MA, Potter CS, Carragher B. Big data in cryoEM: automated collection, processing and accessibility of EM data. Curr Opin Microbiol 2017; 43:1-8. [PMID: 29100109 DOI: 10.1016/j.mib.2017.10.005] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Revised: 09/27/2017] [Accepted: 10/09/2017] [Indexed: 11/24/2022]
Abstract
The scope and complexity of cryogenic electron microscopy (cryoEM) data has greatly increased, and will continue to do so, due to recent and ongoing technical breakthroughs that have led to much improved resolutions for macromolecular structures solved using this method. This big data explosion includes single particle data as well as tomographic tilt series, both generally acquired as direct detector movies of ∼10-100 frames per image or per tilt-series. We provide a brief survey of the developments leading to the current status, and describe existing cryoEM pipelines, with an emphasis on the scope of data acquisition, methods for automation, and use of cloud storage and computing.
Collapse
Affiliation(s)
- Philip R Baldwin
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA
| | - Yong Zi Tan
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA; Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
| | - Edward T Eng
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA
| | - William J Rice
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA
| | - Alex J Noble
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA
| | - Carl J Negro
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA
| | - Michael A Cianfrocco
- Life Sciences Institute and Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Clinton S Potter
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA; Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
| | - Bridget Carragher
- The National Resource for Automated Molecular Microscopy, Simons Electron Microscopy Center, New York Structural Biology Center, 89 Convent Ave, New York, NY 10027, USA; Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
26
|
Cang Z, Wei GW. TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions. PLoS Comput Biol 2017; 13:e1005690. [PMID: 28749969 PMCID: PMC5549771 DOI: 10.1371/journal.pcbi.1005690] [Citation(s) in RCA: 155] [Impact Index Per Article: 22.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Revised: 08/08/2017] [Accepted: 07/18/2017] [Indexed: 11/18/2022] Open
Abstract
Although deep learning approaches have had tremendous success in image, video and audio processing, computer vision, and speech recognition, their applications to three-dimensional (3D) biomolecular structural data sets have been hindered by the geometric and biological complexity. To address this problem we introduce the element-specific persistent homology (ESPH) method. ESPH represents 3D complex geometry by one-dimensional (1D) topological invariants and retains important biological information via a multichannel image-like representation. This representation reveals hidden structure-function relationships in biomolecules. We further integrate ESPH and deep convolutional neural networks to construct a multichannel topological neural network (TopologyNet) for the predictions of protein-ligand binding affinities and protein stability changes upon mutation. To overcome the deep learning limitations from small and noisy training sets, we propose a multi-task multichannel topological convolutional neural network (MM-TCNN). We demonstrate that TopologyNet outperforms the latest methods in the prediction of protein-ligand binding affinities, mutation induced globular protein folding free energy changes, and mutation induced membrane protein folding free energy changes. AVAILABILITY weilab.math.msu.edu/TDL/.
Collapse
Affiliation(s)
- Zixuan Cang
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, MI 48824, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
27
|
Xia K, Opron K, Wei GW. Multiscale Gaussian network model (mGNM) and multiscale anisotropic network model (mANM). J Chem Phys 2016; 143:204106. [PMID: 26627949 DOI: 10.1063/1.4936132] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Gaussian network model (GNM) and anisotropic network model (ANM) are some of the most popular methods for the study of protein flexibility and related functions. In this work, we propose generalized GNM (gGNM) and ANM methods and show that the GNM Kirchhoff matrix can be built from the ideal low-pass filter, which is a special case of a wide class of correlation functions underpinning the linear scaling flexibility-rigidity index (FRI) method. Based on the mathematical structure of correlation functions, we propose a unified framework to construct generalized Kirchhoff matrices whose matrix inverse leads to gGNMs, whereas, the direct inverse of its diagonal elements gives rise to FRI method. With this connection, we further introduce two multiscale elastic network models, namely, multiscale GNM (mGNM) and multiscale ANM (mANM), which are able to incorporate different scales into the generalized Kirchhoff matrices or generalized Hessian matrices. We validate our new multiscale methods with extensive numerical experiments. We illustrate that gGNMs outperform the original GNM method in the B-factor prediction of a set of 364 proteins. We demonstrate that for a given correlation function, FRI and gGNM methods provide essentially identical B-factor predictions when the scale value in the correlation function is sufficiently large. More importantly, we reveal intrinsic multiscale behavior in protein structures. The proposed mGNM and mANM are able to capture this multiscale behavior and thus give rise to a significant improvement of more than 11% in B-factor predictions over the original GNM and ANM methods. We further demonstrate the benefits of our mGNM through the B-factor predictions of many proteins that fail the original GNM method. We show that the proposed mGNM can also be used to analyze protein domain separations. Finally, we showcase the ability of our mANM for the analysis of protein collective motions.
Collapse
Affiliation(s)
- Kelin Xia
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, USA
| | - Kristopher Opron
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan 48824, USA
| | - Guo-Wei Wei
- Mathematical Biosciences Institute, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
28
|
Abstract
Persistent homology provides a new approach for the topological simplification of big data via measuring the life time of intrinsic topological features in a filtration process and has found its success in scientific and engineering applications. However, such a success is essentially limited to qualitative data classification and analysis. Indeed, persistent homology has rarely been employed for quantitative modeling and prediction. Additionally, the present persistent homology is a passive tool, rather than a proactive technique, for classification and analysis. In this work, we outline a general protocol to construct object-oriented persistent homology methods. By means of differential geometry theory of surfaces, we construct an objective functional, namely, a surface free energy defined on the data of interest. The minimization of the objective functional leads to a Laplace-Beltrami operator which generates a multiscale representation of the initial data and offers an objective oriented filtration process. The resulting differential geometry based object-oriented persistent homology is able to preserve desirable geometric features in the evolutionary filtration and enhances the corresponding topological persistence. The cubical complex based homology algorithm is employed in the present work to be compatible with the Cartesian representation of the Laplace-Beltrami flow. The proposed Laplace-Beltrami flow based persistent homology method is extensively validated. The consistence between Laplace-Beltrami flow based filtration and Euclidean distance based filtration is confirmed on the Vietoris-Rips complex for a large amount of numerical tests. The convergence and reliability of the present Laplace-Beltrami flow based cubical complex filtration approach are analyzed over various spatial and temporal mesh sizes. The Laplace-Beltrami flow based persistent homology approach is utilized to study the intrinsic topology of proteins and fullerene molecules. Based on a quantitative model which correlates the topological persistence of fullerene central cavity with the total curvature energy of the fullerene structure, the proposed method is used for the prediction of fullerene isomer stability. The efficiency and robustness of the present method are verified by more than 500 fullerene molecules. It is shown that the proposed persistent homology based quantitative model offers good predictions of total curvature energies for ten types of fullerene isomers. The present work offers the first example to design object-oriented persistent homology to enhance or preserve desirable features in the original data during the filtration process and then automatically detect or extract the corresponding topological traits from the data.
Collapse
Affiliation(s)
- Bao Wang
- Department of Mathematics Michigan State University, MI 48824, USA
| | - Guo-Wei Wei
- Mathematical Biosciences Institute, The Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
29
|
Xia K, Zhao Z, Wei GW. Multiresolution persistent homology for excessively large biomolecular datasets. J Chem Phys 2015; 143:134103. [PMID: 26450288 PMCID: PMC4592433 DOI: 10.1063/1.4931733] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2015] [Accepted: 09/08/2015] [Indexed: 12/21/2022] Open
Abstract
Although persistent homology has emerged as a promising tool for the topological simplification of complex data, it is computationally intractable for large datasets. We introduce multiresolution persistent homology to handle excessively large datasets. We match the resolution with the scale of interest so as to represent large scale datasets with appropriate resolution. We utilize flexibility-rigidity index to access the topological connectivity of the data set and define a rigidity density for the filtration analysis. By appropriately tuning the resolution of the rigidity density, we are able to focus the topological lens on the scale of interest. The proposed multiresolution topological analysis is validated by a hexagonal fractal image which has three distinct scales. We further demonstrate the proposed method for extracting topological fingerprints from DNA molecules. In particular, the topological persistence of a virus capsid with 273 780 atoms is successfully analyzed which would otherwise be inaccessible to the normal point cloud method and unreliable by using coarse-grained multiscale persistent homology. The proposed method has also been successfully applied to the protein domain classification, which is the first time that persistent homology is used for practical protein domain analysis, to our knowledge. The proposed multiresolution topological method has potential applications in arbitrary data sets, such as social networks, biological networks, and graphs.
Collapse
Affiliation(s)
- Kelin Xia
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, USA
| | - Zhixiong Zhao
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan 48824, USA
| |
Collapse
|
30
|
Abstract
Persistent homology has been advocated as a new strategy for the topological simplification of complex data. However, it is computationally intractable for large data sets. In this work, we introduce multiresolution persistent homology for tackling large datasets. Our basic idea is to match the resolution with the scale of interest so as to create a topological microscopy for the underlying data. We adjust the resolution via a rigidity density-based filtration. The proposed multiresolution topological analysis is validated by the study of a complex RNA molecule.
Collapse
Affiliation(s)
- Kelin Xia
- 1 Department of Mathematics, Michigan State University , East Lansing, Michigan
| | - Zhixiong Zhao
- 1 Department of Mathematics, Michigan State University , East Lansing, Michigan
| | - Guo-Wei Wei
- 1 Department of Mathematics, Michigan State University , East Lansing, Michigan.,2 Department of Electrical and Computer Engineering, Michigan State University , East Lansing, Michigan.,3 Department of Biochemistry and Molecular Biology, Michigan State University , East Lansing, Michigan
| |
Collapse
|