1
|
Kombo DC, LaMarche MJ, Konkankit CC, Rackovsky S. Application of artificial intelligence and machine learning techniques to the analysis of dynamic protein sequences. Proteins 2024. [PMID: 38808365 DOI: 10.1002/prot.26704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 05/07/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]
Abstract
We apply methods of Artificial Intelligence and Machine Learning to protein dynamic bioinformatics. We rewrite the sequences of a large protein data set, containing both folded and intrinsically disordered molecules, using a representation developed previously, which encodes the intrinsic dynamic properties of the naturally occurring amino acids. We Fourier analyze the resulting sequences. It is demonstrated that classification models built using several different supervised learning methods are able to successfully distinguish folded from intrinsically disordered proteins from sequence alone. It is further shown that the most important sequence property for this discrimination is the sequence mobility, which is the sequence averaged value of the residue-specific average alpha carbon B factor. This is in agreement with previous work, in which we have demonstrated the central role played by the sequence mobility in protein dynamic bioinformatics and biophysics. This finding opens a path to the application of dynamic bioinformatics, in combination with machine learning algorithms, to a range of significant biomedical problems.
Collapse
Affiliation(s)
- David C Kombo
- Department of Medicinal Chemistry, Integrated Drug Discovery, Cambridge, Massachusetts, USA
| | - Matthew J LaMarche
- Department of Medicinal Chemistry, Integrated Drug Discovery, Cambridge, Massachusetts, USA
| | - Chilaluck C Konkankit
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York, USA
| | - S Rackovsky
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York, USA
| |
Collapse
|
2
|
Konkankit C, Rackovsky S. The dynamic basis of structural order in proteins. Proteins 2022; 90:1115-1118. [PMID: 34981860 PMCID: PMC9007817 DOI: 10.1002/prot.26296] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 12/26/2021] [Accepted: 12/29/2021] [Indexed: 01/21/2023]
Abstract
We compare the sequences of folded and intrinsically disordered proteins (IDPs), using bioinformatic methods recently developed to study protein dynamic properties. We demonstrate that the two classes of sequences are organized in diametrically opposite ways with respect to long-length-scale dynamic properties. We further demonstrate a statistically significant difference between the amino acid compositions of folded and disordered proteins, which is expressed in dynamic properties. Our results indicate that the long-length-scale properties of sequences are critical in determining whether proteins are able to fold, and, more generally, that they are central to an understanding of protein physics. They further provide a physical basis for the empirically observed differences in amino acid composition between folded and IDPs.
Collapse
Affiliation(s)
- Chilaluck Konkankit
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York, USA
| | - S Rackovsky
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York, USA.,Department of Biochemistry and Biophysics, School of Medicine and Dentistry, University of Rochester, Rochester, New York, USA
| |
Collapse
|
3
|
Scheraga H, Rackovsky S. Dynamic and conformational switching in proteins. Biopolymers 2021; 112:e23411. [PMID: 33270217 PMCID: PMC8172660 DOI: 10.1002/bip.23411] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 11/13/2020] [Accepted: 11/18/2020] [Indexed: 11/06/2022]
Abstract
Using bioinformatic methods for treating protein dynamics, developed in earlier work, we study the relationship between sequence mobility and dynamics in proteins. It is shown that sequence mobility drives a transition between two dynamic regimes in proteins, and that the specific details of this transition differ qualitatively between α-helical proteins and those in other structural classes. We examine the possibility that conformational switching is related to dynamic switching, by considering a specific system of sequences which exhibit the switching phenomenon. It is shown that a relationship between dynamic and conformational switching is entirely plausible.
Collapse
Affiliation(s)
- H.A. Scheraga
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University Ithaca, NY 14853
| | - S. Rackovsky
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University Ithaca, NY 14853
- Department of Biochemistry and Biophysics University of Rochester School of Medicine and Dentistry Rochester, NY 14642
| |
Collapse
|
4
|
Abstract
We use a bioinformatic description of amino acid dynamic properties, based on residue-specific average B factors, to construct a dynamics-based, large-scale description of a space of protein sequences. We examine the relationship between that space and an independently constructed, structure-based space comprising the same sequences. It is demonstrated that structure and dynamics are only moderately correlated. It is further shown that helical proteins fall into two classes with very different structure-dynamics relationships. We suggest that dynamics in the two helical classes are dominated by distinctly different modes--pseudo-one-dimensional, localized helical modes in one case, and pseudo-three-dimensional (3D) global modes in the other. Sheet/barrel and mixed-α/β proteins exhibit more conventional structure-dynamics relationships. It is found that the strongest correlation between structure and dynamic properties arises when the latter are represented by the sequence average of the dynamic index, which corresponds physically to the overall mobility of the protein. None of these results are accessible to bioinformatic methods hitherto available.
Collapse
|
5
|
Scheraga HA, Rackovsky S. Sequence-specific dynamic information in proteins. Proteins 2019; 87:799-804. [PMID: 31134683 DOI: 10.1002/prot.25747] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 05/10/2019] [Accepted: 05/22/2019] [Indexed: 11/05/2022]
Abstract
We examine the local and global properties of the average B-factor, 〈B〉, as a residue-specific indicator of protein dynamic characteristics. It has been shown that values of 〈B〉 for the 20 amino acids differ in a statistically significant manner, and that, while strongly determined by the static physical properties of amino acids, they also encode averaged information about the influence of global fold on single-residue dynamics. Therefore, complete sequences of amino acids also encode fold-related global dynamic information, in addition to the local information that arises from static physical properties. We show that the relative magnitudes of these two contributions can be determined using Fourier methods, which represent the global properties of the sequences. It has also been shown that the behavior of Fourier components of 〈B〉 differs, with very high statistical significance, between structural groups, and that this information is not available from a comparable analysis of static amino acid properties.
Collapse
Affiliation(s)
- H A Scheraga
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York
| | - S Rackovsky
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York
| |
Collapse
|
6
|
Beyond Supersecondary Structure: Physics-Based Sequence Alignment. Methods Mol Biol 2019. [PMID: 30945228 DOI: 10.1007/978-1-4939-9161-7_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Traditional approaches to sequence alignment are based on evolutionary ideas. As a result, they are prebiased toward results which are in accord with initial expectations. We present here a method of sequence alignment which is based entirely on the physical properties of the amino acids. This approach has no inherent bias, eliminates much of the computational complexity associated with methods currently in use, and has been shown to give good results for structures which were poorly predicted by traditional methods in recent CASP competitions and to identify sequence differences which correlate with structural and dynamic differences not detectable by traditional methods.
Collapse
|
7
|
Koehl P, Orland H, Delarue M. Numerical Encodings of Amino Acids in Multivariate Gaussian Modeling of Protein Multiple Sequence Alignments. Molecules 2018; 24:E104. [PMID: 30597916 PMCID: PMC6337344 DOI: 10.3390/molecules24010104] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 12/21/2018] [Accepted: 12/24/2018] [Indexed: 11/17/2022] Open
Abstract
Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment.
Collapse
Affiliation(s)
- Patrice Koehl
- Department of Computer Science, University of California, Davis, CA 95211, USA.
| | - Henri Orland
- Institut de Physique Théorique, CEA Saclay, 91191 Gif-sur-Yvette CEDEX, France.
| | - Marc Delarue
- Department of Structural Biology and Chemistry and UMR 3528 du CNRS, Institut Pasteur, 75015 Paris, France.
| |
Collapse
|
8
|
Rackovsky S. Nonlinearities in protein space limit the utility of informatics in protein biophysics. Proteins 2015; 83:1923-8. [PMID: 26315852 DOI: 10.1002/prot.24916] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2015] [Revised: 08/12/2015] [Accepted: 08/20/2015] [Indexed: 11/08/2022]
Abstract
We examine the utility of informatic-based methods in computational protein biophysics. To do so, we use newly developed metric functions to define completely independent sequence and structure spaces for a large database of proteins. By investigating the relationship between these spaces, we demonstrate quantitatively the limits of knowledge-based correlation between the sequences and structures of proteins. It is shown that there are well-defined, nonlinear regions of protein space in which dissimilar structures map onto similar sequences (the conformational switch), and dissimilar sequences map onto similar structures (remote homology). These nonlinearities are shown to be quite common-almost half the proteins in our database fall into one or the other of these two regions. They are not anomalies, but rather intrinsic properties of structural encoding in amino acid sequences. It follows that extreme care must be exercised in using bioinformatic data as a basis for computational structure prediction. The implications of these results for protein evolution are examined.
Collapse
Affiliation(s)
- S Rackovsky
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York, 14853.,Department of Pharmacology and Systems Therapeutics, Icahn School of Medicine at Mount Sinai, New York, New York, 10029
| |
Collapse
|
9
|
Li J, Koehl P. 3D representations of amino acids-applications to protein sequence comparison and classification. Comput Struct Biotechnol J 2014; 11:47-58. [PMID: 25379143 PMCID: PMC4212284 DOI: 10.1016/j.csbj.2014.09.001] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
The amino acid sequence of a protein is the key to understanding its structure and ultimately its function in the cell. This paper addresses the fundamental issue of encoding amino acids in ways that the representation of such a protein sequence facilitates the decoding of its information content. We show that a feature-based representation in a three-dimensional (3D) space derived from amino acid substitution matrices provides an adequate representation that can be used for direct comparison of protein sequences based on geometry. We measure the performance of such a representation in the context of the protein structural fold prediction problem. We compare the results of classifying different sets of proteins belonging to distinct structural folds against classifications of the same proteins obtained from sequence alone or directly from structural information. We find that sequence alone performs poorly as a structure classifier. We show in contrast that the use of the three dimensional representation of the sequences significantly improves the classification accuracy. We conclude with a discussion of the current limitations of such a representation and with a description of potential improvements.
Collapse
Affiliation(s)
- Jie Li
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, United States
| | - Patrice Koehl
- Department of Computer Science and Genome Center, University of California, Davis, One Shields Ave, Davis, CA 95616, United States
| |
Collapse
|
10
|
Shi G, Vogel T, Wüst T, Li YW, Landau DP. Effect of single-site mutations on hydrophobic-polar lattice proteins. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2014; 90:033307. [PMID: 25314564 DOI: 10.1103/physreve.90.033307] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/10/2014] [Indexed: 06/04/2023]
Abstract
We developed a heuristic method for determining the ground-state degeneracy of hydrophobic-polar (HP) lattice proteins, based on Wang-Landau and multicanonical sampling. It is applied during comprehensive studies of single-site mutations in specific HP proteins with different sequences. The effects in which we are interested include structural changes in ground states, changes of ground-state energy, degeneracy, and thermodynamic properties of the system. With respect to mutations, both extremely sensitive and insensitive positions in the HP sequence have been found. That is, ground-state energies and degeneracies, as well as other thermodynamic and structural quantities, may be either largely unaffected or may change significantly due to mutation.
Collapse
Affiliation(s)
- Guangjie Shi
- Center for Simulational Physics, The University of Georgia, Athens, Georgia 30602, USA
| | - Thomas Vogel
- Theoretical Division (T-1), Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| | - Thomas Wüst
- Scientific IT Services, ETH Zürich IT Services, 8092 Zürich, Switzerland
| | - Ying Wai Li
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee 37831, USA
| | - David P Landau
- Center for Simulational Physics, The University of Georgia, Athens, Georgia 30602, USA
| |
Collapse
|
11
|
Homolog detection using global sequence properties suggests an alternate view of structural encoding in protein sequences. Proc Natl Acad Sci U S A 2014; 111:5225-9. [PMID: 24706836 DOI: 10.1073/pnas.1403599111] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We show that a Fourier-based sequence distance function is able to identify structural homologs of target sequences with high accuracy. It is shown that Fourier distances correlate very strongly with independently determined structural distances between molecules, a property of the method that is not attainable using conventional representations. It is further shown that the ability of the Fourier approach to identify protein folds is statistically far in excess of random expectation. It is then shown that, in actual searches for structural homologs of selected target sequences, the Fourier approach gives excellent results. On the basis of these results, we suggest that the global information detected by the Fourier representation is an essential feature of structure encoding in protein sequences and a key to structural homology detection.
Collapse
|
12
|
Skubatz H, Howald WN. Two global conformation states of a novel NAD(P) reductase like protein of the thermogenic appendix of the Sauromatum guttatum inflorescence. Protein J 2014; 32:399-410. [PMID: 23794126 DOI: 10.1007/s10930-013-9497-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
A novel NAD(P) reductase like protein (RL) belonging to a class of reductases involved in phenylpropanoid synthesis was previously purified to homogeneity from the Sauromatum guttatum appendix. The Sauromatum appendix raises its temperature above ambient temperature to ~30 °C on the day of inflorescence opening (D-day). Changes in the charge state distribution of the protein in electrospray ionization-mass spectrometry spectra were observed during the development of the appendix. RL adopted two conformations, state A (an extended state) that appeared before heat-production (D - 4 to D - 2), and state B (a compact state) that began appearing on D - 1 and reached a maximum on D-day. RL in healthy leaves of Arabidopsis is present in state A, whereas in thermogenic sporophylls of male cones of Encephalartos ferox is present in state B. These conformational changes strongly suggest an involvement of RL in heat-production. The biophysical properties of this protein are remarkable. It is self-assembled in aqueous solutions into micrometer sizes of organized morphologies. The assembly produces a broad range of cyclic and linear morphologies that resemble micelles, rods, lamellar micelles, as well as vesicles. The assemblies could also form network structures. RL molecules entangle with each other and formed branched, interconnected networks. These unusual assemblies suggest that RL is an oligomer, and its oligomerization can provide additional information needed for thermoregulation. We hypothesize that state A controls the plant basal temperature and state B allows a shift in the temperature set point to above ambient temperature.
Collapse
|
13
|
Rackovsky S. Sequence determinants of protein architecture. Proteins 2013; 81:1681-5. [PMID: 23720385 DOI: 10.1002/prot.24328] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2013] [Revised: 04/28/2013] [Accepted: 05/09/2013] [Indexed: 11/07/2022]
Abstract
Delineation of the relationship between sequence and structure in proteins has proven elusive. Most studies of this problem use alignment methods and other approaches based on the characteristics of individual residues. It is demonstrated herein that the sequence-structure relationship is determined in significant part by global characteristics of sequence organization. Information encoded in complete sequences is required to distinguish proteins in different architectural groups. It is found that the statistically significant differences between sequences encoding different architectures are encoded in a surprisingly small set of low-wave-number sequence periodicities. It would therefore appear that unexpected simplicity in an appropriately defined Fourier space may be an inherent characteristic of the sequences of folded proteins.
Collapse
Affiliation(s)
- S Rackovsky
- Department of Chemistry and Chemical Biology, Cornell University, Ithaca, New York, 14853
| |
Collapse
|
14
|
Hansen N, Allison JR, Hodel FH, van Gunsteren WF. Relative free enthalpies for point mutations in two proteins with highly similar sequences but different folds. Biochemistry 2013; 52:4962-70. [PMID: 23802564 DOI: 10.1021/bi400272q] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Enveloping distribution sampling was used to calculate free-enthalpy changes associated with single amino acid mutations for a pair of proteins, GA95 and GB95, that show 95% sequence identity yet fold into topologically different structures. Of the L → A, I → F, and L → Y mutations at positions 20, 30, and 45, respectively, of the 56-residue sequence, the first and the last contribute the most to the free-enthalpy difference between the native and non-native sequence-structure combinations, in agreement with the experimental findings for this protein pair. The individual free-enthalpy changes are almost sequence-independent in the four-strand/one-helix structure, the stable form of GB95, while in the three-helix bundle structure, the stable form of GA95, an interplay between residues 20 and 45 is observed.
Collapse
Affiliation(s)
- Niels Hansen
- Laboratory of Physical Chemistry, Swiss Federal Institute of Technology , ETH, CH-8093 Zürich, Switzerland
| | | | | | | |
Collapse
|
15
|
Abstract
Analysis of the global properties of protein sequences, rather than single-site or local properties, has been shown to lead to new understanding of folding and function. Here we describe the use of software which can describe sequences numerically in an orthonormal fashion, Fourier-analyze those sequences, and verify the statistical significance of the resulting Fourier coefficients. The resulting parameters can be used to study problems involving sequences from a unique perspective.
Collapse
|
16
|
Gridnev DK, Ojeda-May P, Garcia ME. Selecting fast-folding proteins by their rate of convergence. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2013; 87:012714. [PMID: 23410366 DOI: 10.1103/physreve.87.012714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2012] [Revised: 11/26/2012] [Indexed: 06/01/2023]
Abstract
We propose a general method for predicting potentially good folders from a given number of amino acid sequences. Our approach is based on the calculation of the rate of convergence of each amino acid chain towards the native structure using only the very initial parts of the dynamical trajectories. It does not require any preliminary knowledge of the native state and can be applied to different kinds of models, including atomistic descriptions. We tested the method within both the lattice and off-lattice model frameworks and obtained several so far unknown good folders. The unbiased algorithm also allows one to determine the optimal folding temperature and takes at least 3-4 orders of magnitude fewer time steps than those needed to compute folding times.
Collapse
|
17
|
Holzgräfe C, Irbäck A, Troein C. Mutation-induced fold switching among lattice proteins. J Chem Phys 2011; 135:195101. [DOI: 10.1063/1.3660691] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|