1
|
Anton B, Besalú M, Fornes O, Bonet J, Molina A, Molina-Fernandez R, De Las Cuevas G, Fernandez-Fuentes N, Oliva B. On the use of direct-coupling analysis with a reduced alphabet of amino acids combined with super-secondary structure motifs for protein fold prediction. NAR Genom Bioinform 2021; 3:lqab027. [PMID: 33937764 PMCID: PMC8061457 DOI: 10.1093/nargab/lqab027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2020] [Revised: 02/27/2021] [Accepted: 03/26/2021] [Indexed: 11/12/2022] Open
Abstract
Direct-coupling analysis (DCA) for studying the coevolution of residues in proteins has been widely used to predict the three-dimensional structure of a protein from its sequence. We present RADI/raDIMod, a variation of the original DCA algorithm that groups chemically equivalent residues combined with super-secondary structure motifs to model protein structures. Interestingly, the simplification produced by grouping amino acids into only two groups (polar and non-polar) is still representative of the physicochemical nature that characterizes the protein structure and it is in line with the role of hydrophobic forces in protein-folding funneling. As a result of a compressed alphabet, the number of sequences required for the multiple sequence alignment is reduced. The number of long-range contacts predicted is limited; therefore, our approach requires the use of neighboring sequence-positions. We use the prediction of secondary structure and motifs of super-secondary structures to predict local contacts. We use RADI and raDIMod, a fragment-based protein structure modelling, achieving near native conformations when the number of super-secondary motifs covers >30-50% of the sequence. Interestingly, although different contacts are predicted with different alphabets, they produce similar structures.
Collapse
Affiliation(s)
- Bernat Anton
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Mireia Besalú
- Departament de Genètica, Microbiologia i Estadística, Universitat de Barcelona, Barcelona 08028, Catalonia, Spain
| | - Oriol Fornes
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Jaume Bonet
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Alexis Molina
- Electronic and Atomic Protein Modeling, Life Sciences, Barcelona Supercomputing Center, Barcelona 08034, Catalonia, Spain
| | - Ruben Molina-Fernandez
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| | - Gemma De Las Cuevas
- Institut für Theoritische Physik, School of Mathematics, Computer Science and Physics, Universität Innsbruck. A-6020 Innsbruck, Austria
| | - Narcis Fernandez-Fuentes
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, SY233EB Aberystwyth, United Kingdom
| | - Baldo Oliva
- Structural Bioinformatics Lab (GRIB-IMIM), Department of Experimental and Health Science, University Pompeu Fabra, Barcelona 08005, Catalonia, Spain
| |
Collapse
|
2
|
Shrestha R, Fajardo E, Gil N, Fidelis K, Kryshtafovych A, Monastyrskyy B, Fiser A. Assessing the accuracy of contact predictions in CASP13. Proteins 2019; 87:1058-1068. [PMID: 31587357 PMCID: PMC6851495 DOI: 10.1002/prot.25819] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Revised: 09/17/2019] [Accepted: 09/17/2019] [Indexed: 01/07/2023]
Abstract
The accuracy of sequence-based tertiary contact predictions was assessed in a blind prediction experiment at the CASP13 meeting. After 4 years of significant improvements in prediction accuracy, another dramatic advance has taken place since CASP12 was held 2 years ago. The precision of predicting the top L/5 contacts in the free modeling category, where L is the corresponding length of the protein in residues, has exceeded 70%. As a comparison, the best-performing group at CASP12 with a 47% precision would have finished below the top 1/3 of the CASP13 groups. Extensively trained deep neural network approaches dominate the top performing algorithms, which appear to efficiently integrate information on coevolving residues and interacting fragments or possibly utilize memories of sequence similarities and sometimes can deliver accurate results even in the absence of virtually any target specific evolutionary information. If the current performance is evaluated by F-score on L contacts, it stands around 24% right now, which, despite the tremendous impact and advance in improving its utility for structure modeling, also suggests that there is much room left for further improvement.
Collapse
Affiliation(s)
- Rojan Shrestha
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Eduardo Fajardo
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Nelson Gil
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| | - Krzysztof Fidelis
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andriy Kryshtafovych
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Bohdan Monastyrskyy
- Genome Center, University of California, Davis, 451 Health Sciences Dr., Davis CA 95616-8816, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, and Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Avenue, Bronx, NY 10461, USA
| |
Collapse
|
3
|
Mackenzie CO, Grigoryan G. Protein structural motifs in prediction and design. Curr Opin Struct Biol 2017; 44:161-167. [PMID: 28460216 PMCID: PMC5513761 DOI: 10.1016/j.sbi.2017.03.012] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2016] [Revised: 03/18/2017] [Accepted: 03/28/2017] [Indexed: 01/11/2023]
Abstract
The Protein Data Bank (PDB) has been an integral resource for shaping our fundamental understanding of protein structure and for the advancement of such applications as protein design and structure prediction. Over the years, information from the PDB has been used to generate models ranging from specific structural mechanisms to general statistical potentials. With accumulating structural data, it has become possible to mine for more complete and complex structural observations, deducing more accurate generalizations. Motif libraries, which capture recurring structural features along with their sequence preferences, have exposed modularity in the structural universe and found successful application in various problems of structural biology. Here we summarize recent achievements in this arena, focusing on subdomain level structural patterns and their applications to protein design and structure prediction, and suggest promising future directions as the structural database continues to grow.
Collapse
Affiliation(s)
- Craig O Mackenzie
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, United States
| | - Gevorg Grigoryan
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, United States; Department of Computer Science, Dartmouth College, Hanover, NH 03755, United States.
| |
Collapse
|
4
|
Tang K, Zhang J, Liang J. Distance-Guided Forward and Backward Chain-Growth Monte Carlo Method for Conformational Sampling and Structural Prediction of Antibody CDR-H3 Loops. J Chem Theory Comput 2016; 13:380-388. [PMID: 27996262 DOI: 10.1021/acs.jctc.6b00845] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Antibodies recognize antigens through the complementary determining regions (CDR) formed by six-loop hypervariable regions crucial for the diversity of antigen specificities. Among the six CDR loops, the H3 loop is the most challenging to predict because of its much higher variation in sequence length and identity, resulting in much larger and complex structural space, compared to the other five loops. We developed a novel method based on a chain-growth sequential Monte Carlo method, called distance-guided sequential chain-growth Monte Carlo for H3 loops (DiSGro-H3). The new method samples protein chains in both forward and backward directions. It can efficiently generate low energy, near-native H3 loop structures using the conformation types predicted from the sequences of H3 loops. DiSGro-H3 performs significantly better than another ab initio method, RosettaAntibody, in both sampling and prediction, while taking less computational time. It performs comparably to template-based methods. As an ab initio method, DiSGro-H3 offers satisfactory accuracy while being able to predict any H3 loops without templates.
Collapse
Affiliation(s)
- Ke Tang
- Department of Bioengineering, University of Illinois at Chicago , Chicago, Illinois 60607, United States
| | - Jinfeng Zhang
- Department of Statistics, Florida State University , Tallahassee, Florida 32306, United States
| | - Jie Liang
- Department of Bioengineering, University of Illinois at Chicago , Chicago, Illinois 60607, United States
| |
Collapse
|
5
|
Berezovsky IN, Guarnera E, Zheng Z, Eisenhaber B, Eisenhaber F. Protein function machinery: from basic structural units to modulation of activity. Curr Opin Struct Biol 2016; 42:67-74. [PMID: 27865209 DOI: 10.1016/j.sbi.2016.10.021] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Revised: 09/26/2016] [Accepted: 10/31/2016] [Indexed: 11/29/2022]
Abstract
Contemporary protein structure is a result of the trade off between the laws of physics and the evolutionary selection. The polymer nature of proteins played a decisive role in establishing the basic structural and functional units of soluble proteins. We discuss how these elementary building blocks work in the hierarchy of protein domain structure, co-translational folding, as well as in enzymatic activity and molecular interactions. Next, we consider modulators of the protein function, such as intermolecular interactions, disorder-to-order transitions, and allosteric signaling, acting via interference with the protein's structural dynamics. We also discuss the post-translational modifications, which is a complementary intricate mechanism evolved for regulation of protein functions and interactions. In conclusion, we assess an anticipated contribution of discussed topics to the future advancements in the field.
Collapse
Affiliation(s)
- Igor N Berezovsky
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore; Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, Singapore 117579, Singapore.
| | - Enrico Guarnera
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - Zejun Zheng
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - Birgit Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore
| | - Frank Eisenhaber
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, Singapore 138671, Singapore; School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, Singapore 637553, Singapore
| |
Collapse
|
6
|
Abstract
Here, we systematically decompose the known protein structural universe into its basic elements, which we dub tertiary structural motifs (TERMs). A TERM is a compact backbone fragment that captures the secondary, tertiary, and quaternary environments around a given residue, comprising one or more disjoint segments (three on average). We seek the set of universal TERMs that capture all structure in the Protein Data Bank (PDB), finding remarkable degeneracy. Only ∼600 TERMs are sufficient to describe 50% of the PDB at sub-Angstrom resolution. However, more rare geometries also exist, and the overall structural coverage grows logarithmically with the number of TERMs. We go on to show that universal TERMs provide an effective mapping between sequence and structure. We demonstrate that TERM-based statistics alone are sufficient to recapitulate close-to-native sequences given either NMR or X-ray backbones. Furthermore, sequence variability predicted from TERM data agrees closely with evolutionary variation. Finally, locations of TERMs in protein chains can be predicted from sequence alone based on sequence signatures emergent from TERM instances in the PDB. For multisegment motifs, this method identifies spatially adjacent fragments that are not contiguous in sequence-a major bottleneck in structure prediction. Although all TERMs recur in diverse proteins, some appear specialized for certain functions, such as interface formation, metal coordination, or even water binding. Structural biology has benefited greatly from previously observed degeneracies in structure. The decomposition of the known structural universe into a finite set of compact TERMs offers exciting opportunities toward better understanding, design, and prediction of protein structure.
Collapse
|
7
|
Berezovsky IN, Guarnera E, Zheng Z. Basic units of protein structure, folding, and function. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2016; 128:85-99. [PMID: 27697476 DOI: 10.1016/j.pbiomolbio.2016.09.009] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Revised: 09/05/2016] [Accepted: 09/26/2016] [Indexed: 10/20/2022]
Abstract
Study of the hierarchy of domain structure with alternative sets of domains and analysis of discontinuous domains, consisting of remote segments of the polypeptide chain, raised a question about the minimal structural unit of the protein domain. The hypothesis on the decisive role of the polypeptide backbone in determining the elementary units of globular proteins have led to the discovery of closed loops. It is reviewed here how closed loops form the loop-n-lock structure of proteins, providing the foundation for stability and designability of protein folds/domain and underlying their co-translational folding. Simplified protein sequences are considered here with the aim to explore the basic principles that presumably dominated the folding and stability of proteins in the early stages of structural evolution. Elementary functional loops (EFLs), closed loops with one or few catalytic residues, are, in turn, units of the protein function. They are apparent descendants of the prebiotic ring-like peptides, which gave rise to the first functional folds/domains being fused in the beginning of the evolution of protein structure. It is also shown how evolutionary relations between protein functional superfamilies and folds delineated with the help of EFLs can contribute to establishing the rules for design of desired enzymatic functions. Generalized descriptors of the elementary functions are proposed to be used as basic units in the future computational design.
Collapse
Affiliation(s)
- Igor N Berezovsky
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore; Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive, 117579, Singapore.
| | - Enrico Guarnera
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
| | - Zejun Zheng
- Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
| |
Collapse
|
8
|
Vallat B, Madrid-Aliste C, Fiser A. Modularity of Protein Folds as a Tool for Template-Free Modeling of Structures. PLoS Comput Biol 2015; 11:e1004419. [PMID: 26252221 PMCID: PMC4529212 DOI: 10.1371/journal.pcbi.1004419] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 06/30/2015] [Indexed: 12/25/2022] Open
Abstract
Predicting the three-dimensional structure of proteins from their amino acid sequences remains a challenging problem in molecular biology. While the current structural coverage of proteins is almost exclusively provided by template-based techniques, the modeling of the rest of the protein sequences increasingly require template-free methods. However, template-free modeling methods are much less reliable and are usually applicable for smaller proteins, leaving much space for improvement. We present here a novel computational method that uses a library of supersecondary structure fragments, known as Smotifs, to model protein structures. The library of Smotifs has saturated over time, providing a theoretical foundation for efficient modeling. The method relies on weak sequence signals from remotely related protein structures to create a library of Smotif fragments specific to the target protein sequence. This Smotif library is exploited in a fragment assembly protocol to sample decoys, which are assessed by a composite scoring function. Since the Smotif fragments are larger in size compared to the ones used in other fragment-based methods, the proposed modeling algorithm, SmotifTF, can employ an exhaustive sampling during decoy assembly. SmotifTF successfully predicts the overall fold of the target proteins in about 50% of the test cases and performs competitively when compared to other state of the art prediction methods, especially when sequence signal to remote homologs is diminishing. Smotif-based modeling is complementary to current prediction methods and provides a promising direction in addressing the structure prediction problem, especially when targeting larger proteins for modeling. Each protein folds into a unique three-dimensional structure that enables it to carry out its biological function. Knowledge of the atomic details of protein structures is therefore a key to understanding their function. Advances in high throughput experimental technologies have lead to an exponential increase in the availability of known protein sequences. Although strong progress has been made in experimental protein structure determination, it remains a fact that more than 99% of structural information is provided by computational modeling methods. We describe here a novel structure prediction method, SmotifTF, which uses a unique library of known protein fragments to assemble the three-dimensional structure of a sequence. The fragment library has saturated over time and therefore provides a complete set of building blocks required for model building. The method performs competitively compared to existing methods of structure prediction.
Collapse
Affiliation(s)
- Brinda Vallat
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, New York, United States of America
| | - Carlos Madrid-Aliste
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, New York, United States of America
| | - Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, New York, United States of America
| |
Collapse
|
9
|
Tang K, Zhang J, Liang J. Fast protein loop sampling and structure prediction using distance-guided sequential chain-growth Monte Carlo method. PLoS Comput Biol 2014; 10:e1003539. [PMID: 24763317 PMCID: PMC3998890 DOI: 10.1371/journal.pcbi.1003539] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2013] [Accepted: 02/01/2014] [Indexed: 11/18/2022] Open
Abstract
Loops in proteins are flexible regions connecting regular secondary structures. They are often involved in protein functions through interacting with other molecules. The irregularity and flexibility of loops make their structures difficult to determine experimentally and challenging to model computationally. Conformation sampling and energy evaluation are the two key components in loop modeling. We have developed a new method for loop conformation sampling and prediction based on a chain growth sequential Monte Carlo sampling strategy, called Distance-guided Sequential chain-Growth Monte Carlo (DISGRO). With an energy function designed specifically for loops, our method can efficiently generate high quality loop conformations with low energy that are enriched with near-native loop structures. The average minimum global backbone RMSD for 1,000 conformations of 12-residue loops is 1:53 A° , with a lowest energy RMSD of 2:99 A° , and an average ensembleRMSD of 5:23 A° . A novel geometric criterion is applied to speed up calculations. The computational cost of generating 1,000 conformations for each of the x loops in a benchmark dataset is only about 10 cpu minutes for 12-residue loops, compared to ca 180 cpu minutes using the FALCm method. Test results on benchmark datasets show that DISGRO performs comparably or better than previous successful methods, while requiring far less computing time. DISGRO is especially effective in modeling longer loops (10-17 residues).
Collapse
Affiliation(s)
- Ke Tang
- Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois, United States of America
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, Florida, United States of America
- * E-mail: (JZ); (JL)
| | - Jie Liang
- Department of Bioengineering, University of Illinois at Chicago, Chicago, Illinois, United States of America
- * E-mail: (JZ); (JL)
| |
Collapse
|
10
|
Dasgupta B, Dey S, Chakrabarti P. Water and side-chain embedded π-turns. Biopolymers 2014; 101:441-53. [PMID: 23996674 DOI: 10.1002/bip.22401] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2013] [Revised: 08/24/2013] [Accepted: 08/26/2013] [Indexed: 11/08/2022]
Affiliation(s)
- Bhaskar Dasgupta
- Department of Biochemistry; Bose Institute; P-1/12 CIT Scheme VIIM Kolkata West Bengal 700 054 India
| | - Sucharita Dey
- Bioinformatics Centre; Bose Institute; P-1/12 CIT Scheme VIIM Kolkata West Bengal 700 054 India
| | - Pinak Chakrabarti
- Department of Biochemistry; Bose Institute; P-1/12 CIT Scheme VIIM Kolkata West Bengal 700 054 India
- Bioinformatics Centre; Bose Institute; P-1/12 CIT Scheme VIIM Kolkata West Bengal 700 054 India
| |
Collapse
|