1
|
Kombo DC, LaMarche MJ, Konkankit CC, Rackovsky S. Application of artificial intelligence and machine learning techniques to the analysis of dynamic protein sequences. Proteins 2024. [PMID: 38808365 DOI: 10.1002/prot.26704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 05/07/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]
Abstract
We apply methods of Artificial Intelligence and Machine Learning to protein dynamic bioinformatics. We rewrite the sequences of a large protein data set, containing both folded and intrinsically disordered molecules, using a representation developed previously, which encodes the intrinsic dynamic properties of the naturally occurring amino acids. We Fourier analyze the resulting sequences. It is demonstrated that classification models built using several different supervised learning methods are able to successfully distinguish folded from intrinsically disordered proteins from sequence alone. It is further shown that the most important sequence property for this discrimination is the sequence mobility, which is the sequence averaged value of the residue-specific average alpha carbon B factor. This is in agreement with previous work, in which we have demonstrated the central role played by the sequence mobility in protein dynamic bioinformatics and biophysics. This finding opens a path to the application of dynamic bioinformatics, in combination with machine learning algorithms, to a range of significant biomedical problems.
Collapse
Affiliation(s)
- David C Kombo
- Department of Medicinal Chemistry, Integrated Drug Discovery, Cambridge, Massachusetts, USA
| | - Matthew J LaMarche
- Department of Medicinal Chemistry, Integrated Drug Discovery, Cambridge, Massachusetts, USA
| | - Chilaluck C Konkankit
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York, USA
| | - S Rackovsky
- Department of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, New York, USA
| |
Collapse
|
2
|
Ruan B, He Y, Chen Y, Choi EJ, Chen Y, Motabar D, Solomon T, Simmerman R, Kauffman T, Gallagher DT, Orban J, Bryan PN. Design and characterization of a protein fold switching network. Nat Commun 2023; 14:431. [PMID: 36702827 PMCID: PMC9879998 DOI: 10.1038/s41467-023-36065-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 01/13/2023] [Indexed: 01/27/2023] Open
Abstract
To better understand how amino acid sequence encodes protein structure, we engineered mutational pathways that connect three common folds (3α, β-grasp, and α/β-plait). The structures of proteins at high sequence-identity intersections in the pathways (nodes) were determined using NMR spectroscopy and analyzed for stability and function. To generate nodes, the amino acid sequence encoding a smaller fold is embedded in the structure of an ~50% larger fold and a new sequence compatible with two sets of native interactions is designed. This generates protein pairs with a 3α or β-grasp fold in the smaller form but an α/β-plait fold in the larger form. Further, embedding smaller antagonistic folds creates critical states in the larger folds such that single amino acid substitutions can switch both their fold and function. The results help explain the underlying ambiguity in the protein folding code and show that new protein structures can evolve via abrupt fold switching.
Collapse
Affiliation(s)
- Biao Ruan
- Potomac Affinity Proteins, 11305 Dunleith Pl, North Potomac, MD, 20878, USA
| | - Yanan He
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA
| | - Yingwei Chen
- Potomac Affinity Proteins, 11305 Dunleith Pl, North Potomac, MD, 20878, USA
| | - Eun Jung Choi
- Potomac Affinity Proteins, 11305 Dunleith Pl, North Potomac, MD, 20878, USA
| | - Yihong Chen
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA
| | - Dana Motabar
- Potomac Affinity Proteins, 11305 Dunleith Pl, North Potomac, MD, 20878, USA
- Department of Bioengineering, University of Maryland, College Park, MD, 20742, USA
| | - Tsega Solomon
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA
- Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, 20742, USA
| | - Richard Simmerman
- Potomac Affinity Proteins, 11305 Dunleith Pl, North Potomac, MD, 20878, USA
| | - Thomas Kauffman
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA
- Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, 20742, USA
| | - D Travis Gallagher
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA
- National Institute of Standards and Technology and the University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA
| | - John Orban
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA.
- Department of Chemistry and Biochemistry, University of Maryland, College Park, MD, 20742, USA.
| | - Philip N Bryan
- Potomac Affinity Proteins, 11305 Dunleith Pl, North Potomac, MD, 20878, USA.
- Institute for Bioscience and Biotechnology Research, University of Maryland, 9600 Gudelsky Drive, Rockville, MD, 20850, USA.
| |
Collapse
|
3
|
Solis AD. Reduced alphabet of prebiotic amino acids optimally encodes the conformational space of diverse extant protein folds. BMC Evol Biol 2019; 19:158. [PMID: 31362700 PMCID: PMC6668081 DOI: 10.1186/s12862-019-1464-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 06/19/2019] [Indexed: 11/10/2022] Open
Abstract
Background There is wide agreement that only a subset of the twenty standard amino acids existed prebiotically in sufficient concentrations to form functional polypeptides. We ask how this subset, postulated as {A,D,E,G,I,L,P,S,T,V}, could have formed structures stable enough to found metabolic pathways. Inspired by alphabet reduction experiments, we undertook a computational analysis to measure the structural coding behavior of sequences simplified by reduced alphabets. We sought to discern characteristics of the prebiotic set that would endow it with unique properties relevant to structure, stability, and folding. Results Drawing on a large dataset of single-domain proteins, we employed an information-theoretic measure to assess how well the prebiotic amino acid set preserves fold information against all other possible ten-amino acid sets. An extensive virtual mutagenesis procedure revealed that the prebiotic set excellently preserves sequence-dependent information regarding both backbone conformation and tertiary contact matrix of proteins. We observed that information retention is fold-class dependent: the prebiotic set sufficiently encodes the structure space of α/β and α + β folds, and to a lesser extent, of all-α and all-β folds. The prebiotic set appeared insufficient to encode the small proteins. Assessing how well the prebiotic set discriminates native vs. incorrect sequence-structure matches, we found that α/β and α + β folds exhibit more pronounced energy gaps with the prebiotic set than with nearly all alternatives. Conclusions The prebiotic set optimally encodes local backbone structures that appear in the folded environment and near-optimally encodes the tertiary contact matrix of extant proteins. The fold-class-specific patterns observed from our structural analysis confirm the postulated timeline of fold appearance in proteogenesis derived from proteomic sequence analyses. Polypeptides arising in a prebiotic environment will likely form α/β and α + β-like folds if any at all. We infer that the progressive expansion of the alphabet allowed the increased conformational stability and functional specificity of later folds, including all-α, all-β, and small proteins. Our results suggest that prebiotic sequences are amenable to mutations that significantly lower native conformational energies and increase discrimination amidst incorrect folds. This property may have assisted the genesis of functional proto-enzymes prior to the expansion of the full amino acid alphabet.
Collapse
Affiliation(s)
- Armando D Solis
- Biological Sciences Department, New York City College of Technology (City Tech), The City University of New York (CUNY), 285 Jay Street, Brooklyn, NY, 11201, USA.
| |
Collapse
|
4
|
Beyond Supersecondary Structure: Physics-Based Sequence Alignment. Methods Mol Biol 2019. [PMID: 30945228 DOI: 10.1007/978-1-4939-9161-7_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Traditional approaches to sequence alignment are based on evolutionary ideas. As a result, they are prebiased toward results which are in accord with initial expectations. We present here a method of sequence alignment which is based entirely on the physical properties of the amino acids. This approach has no inherent bias, eliminates much of the computational complexity associated with methods currently in use, and has been shown to give good results for structures which were poorly predicted by traditional methods in recent CASP competitions and to identify sequence differences which correlate with structural and dynamic differences not detectable by traditional methods.
Collapse
|
5
|
Herman JL. Enhancing Statistical Multiple Sequence Alignment and Tree Inference Using Structural Information. Methods Mol Biol 2019; 1851:183-214. [PMID: 30298398 DOI: 10.1007/978-1-4939-8736-8_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
For highly divergent sequences, there is often insufficient information to reliably construct alignments and phylogenetic trees. Since protein structure may be strongly conserved despite large divergences in sequence, structural information can be used to help identify homology in such cases.While there exist well-studied models of sequence evolution, structurally informed alignment methods have typically made use of geometric measures of deviation that do not take into account the underlying mutational processes. In order to integrate structural information into sequence-based evolutionary models, we recently developed a stochastic model of structural evolution on a phylogenetic tree and implemented this as the StructAlign plugin for the StatAlign statistical alignment package.In this chapter, we will outline the types of analyses that can be carried out using StructAlign, illustrating how the inclusion of structural information can be used to inform joint estimation of alignments and trees. StructAlign can also be used to infer branch-specific rates of structural evolution, and analysis of an example globin dataset highlights strong variation in the inferred rate across the tree. While structure is more highly conserved within clades, the rate of structural divergence as a function of sequence variation is larger between functionally divergent proteins. Allowing for the rate of structural divergence to vary over the tree results in an improved fit to the empirically observed pairwise RMSD values.
Collapse
Affiliation(s)
- Joseph L Herman
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
6
|
Chu H, Liu H. TetraBASE: A Side Chain-Independent Statistical Energy for Designing Realistically Packed Protein Backbones. J Chem Inf Model 2018; 58:430-442. [PMID: 29314837 DOI: 10.1021/acs.jcim.7b00677] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
To construct backbone structures of high designability is a primary aspect of computational protein design. We report here a side chain-independent statistical energy that aims at realistic modeling of through-space packing of polypeptide backbones. To mitigate the lack of explicit amino acid side chains, the model treats the interbackbone site packing as being dependent on peptide local conformation. In addition, new variables suitable for statistical analysis, one for relative orientation and another for distance, have been introduced to represent the intersite geometry based on the asymmetrical tetrahedron organization of distinct chemical groups surrounding the Cα-carbon atoms. The resulting tetrahedron-based backbone statistical energy (tetraBASE) model has been used to optimize the tertiary organizations of secondary structure elements (SSEs) of designated types with Monte Caro simulated annealing, starting from artificial initial configurations. The tetraBASE minimum energy structures can reproduce SSE packing frequently observed in native proteins with atomic root-mean-square deviations of 1-2 Å. The model has also been tested by examining the stability of native SSE arrangements under tetraBASE. The results suggest that tetraBASE model can be used to effectively represent interbackbone packing when designing backbone structures without explicitly knowing side chain types.
Collapse
Affiliation(s)
- Huanyu Chu
- School of Life Sciences, University of Science and Technology of China , 230027 Hefei, Anhui China.,Hefei National Laboratory for Physical Sciences at the Microscales , 230027 Hefei, Anhui China
| | - Haiyan Liu
- School of Life Sciences, University of Science and Technology of China , 230027 Hefei, Anhui China.,Hefei National Laboratory for Physical Sciences at the Microscales , 230027 Hefei, Anhui China.,Collaborative Innovation Center of Chemistry for Life Sciences , 230027 Hefei, Anhui China
| |
Collapse
|
7
|
Monzon AM, Zea DJ, Marino-Buslje C, Parisi G. Homology modeling in a dynamical world. Protein Sci 2017; 26:2195-2206. [PMID: 28815769 DOI: 10.1002/pro.3274] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2017] [Revised: 08/09/2017] [Accepted: 08/09/2017] [Indexed: 12/31/2022]
Abstract
A key concept in template-based modeling (TBM) is the high correlation between sequence and structural divergence, with the practical consequence that homologous proteins that are similar at the sequence level will also be similar at the structural level. However, conformational diversity of the native state will reduce the correlation between structural and sequence divergence, because structural variation can appear without sequence diversity. In this work, we explore the impact that conformational diversity has on the relationship between structural and sequence divergence. We find that the extent of conformational diversity can be as high as the maximum structural divergence among families. Also, as expected, conformational diversity impairs the well-established correlation between sequence and structural divergence, which is nosier than previously suggested. However, we found that this noise can be resolved using a priori information coming from the structure-function relationship. We show that protein families with low conformational diversity show a well-correlated relationship between sequence and structural divergence, which is severely reduced in proteins with larger conformational diversity. This lack of correlation could impair TBM results in highly dynamical proteins. Finally, we also find that the presence of order/disorder can provide useful beforehand information for better TBM performance.
Collapse
Affiliation(s)
- Alexander Miguel Monzon
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, B1876BXD, Bernal, Argentina
| | - Diego Javier Zea
- Structural Bioinformatics Unit, Fundación Instituto Leloir, CONICET, C1405BWE Ciudad Autónoma de Buenos Aires, Argentina
| | - Cristina Marino-Buslje
- Structural Bioinformatics Unit, Fundación Instituto Leloir, CONICET, C1405BWE Ciudad Autónoma de Buenos Aires, Argentina
| | - Gustavo Parisi
- Departamento de Ciencia y Tecnología, Universidad Nacional de Quilmes, CONICET, B1876BXD, Bernal, Argentina
| |
Collapse
|
8
|
Global informatics and physical property selection in protein sequences. Proc Natl Acad Sci U S A 2016; 113:1808-10. [PMID: 26831093 DOI: 10.1073/pnas.1525745113] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The degree of informatic independence between the physical properties of amino acids as encoded in actual protein sequences is calculated. It is shown that no physical property can be identified that carries significantly less information than others and that the information overlap between different properties and different length scales along the sequence is essentially zero. These observations suggest that bioinformatic models based on arbitrarily selected sets of physical properties are inherently deficient.
Collapse
|