1
|
An interactive visualization tool for educational outreach in protein contact map overlap analysis. FRONTIERS IN BIOINFORMATICS 2024; 4:1358550. [PMID: 38562910 PMCID: PMC10982686 DOI: 10.3389/fbinf.2024.1358550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 03/04/2024] [Indexed: 04/04/2024] Open
Abstract
Recent advancements in contact map-based protein three-dimensional (3D) structure prediction have been driven by the evolution of deep learning algorithms. However, the gap in accessible software tools for novices in this domain remains a significant challenge. This study introduces GoFold, a novel, standalone graphical user interface (GUI) designed for beginners to perform contact map overlap (CMO) problems for better template selection. Unlike existing tools that cater more to research needs or assume foundational knowledge, GoFold offers an intuitive, user-friendly platform with comprehensive tutorials. It stands out in its ability to visually represent the CMO problem, allowing users to input proteins in various formats and explore the CMO problem. The educational value of GoFold is demonstrated through benchmarking against the state-of-the-art contact map overlap method, map_align, using two datasets: PSICOV and CAMEO. GoFold exhibits superior performance in terms of TM-score and Z-score metrics across diverse qualities of contact maps and target difficulties. Notably, GoFold runs efficiently on personal computers without any third-party dependencies, thereby making it accessible to the general public for promoting citizen science. The tool is freely available for download for macOS, Linux, and Windows.
Collapse
|
2
|
Recent Progress of Protein Tertiary Structure Prediction. Molecules 2024; 29:832. [PMID: 38398585 PMCID: PMC10893003 DOI: 10.3390/molecules29040832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 02/06/2024] [Accepted: 02/08/2024] [Indexed: 02/25/2024] Open
Abstract
The prediction of three-dimensional (3D) protein structure from amino acid sequences has stood as a significant challenge in computational and structural bioinformatics for decades. Recently, the widespread integration of artificial intelligence (AI) algorithms has substantially expedited advancements in protein structure prediction, yielding numerous significant milestones. In particular, the end-to-end deep learning method AlphaFold2 has facilitated the rise of structure prediction performance to new heights, regularly competitive with experimental structures in the 14th Critical Assessment of Protein Structure Prediction (CASP14). To provide a comprehensive understanding and guide future research in the field of protein structure prediction for researchers, this review describes various methodologies, assessments, and databases in protein structure prediction, including traditionally used protein structure prediction methods, such as template-based modeling (TBM) and template-free modeling (FM) approaches; recently developed deep learning-based methods, such as contact/distance-guided methods, end-to-end folding methods, and protein language model (PLM)-based methods; multi-domain protein structure prediction methods; the CASP experiments and related assessments; and the recently released AlphaFold Protein Structure Database (AlphaFold DB). We discuss their advantages, disadvantages, and application scopes, aiming to provide researchers with insights through which to understand the limitations, contexts, and effective selections of protein structure prediction methods in protein-related fields.
Collapse
|
3
|
General features of transmembrane beta barrels from a large database. Proc Natl Acad Sci U S A 2023; 120:e2220762120. [PMID: 37432995 PMCID: PMC10629564 DOI: 10.1073/pnas.2220762120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 06/03/2023] [Indexed: 07/13/2023] Open
Abstract
Large datasets contribute new insights to subjects formerly investigated by exemplars. We used coevolution data to create a large, high-quality database of transmembrane β-barrels (TMBB). By applying simple feature detection on generated evolutionary contact maps, our method (IsItABarrel) achieves 95.88% balanced accuracy when discriminating among protein classes. Moreover, comparison with IsItABarrel revealed a high rate of false positives in previous TMBB algorithms. In addition to being more accurate than previous datasets, our database (available online) contains 1,938,936 bacterial TMBB proteins from 38 phyla, respectively, 17 and 2.2 times larger than the previous sets TMBB-DB and OMPdb. We anticipate that due to its quality and size, the database will serve as a useful resource where high-quality TMBB sequence data are required. We found that TMBBs can be divided into 11 types, three of which have not been previously reported. We find tremendous variance in proteome percentage among TMBB-containing organisms with some using 6.79% of their proteome for TMBBs and others using as little as 0.27% of their proteome. The distribution of the lengths of the TMBBs is suggestive of previously hypothesized duplication events. In addition, we find that the C-terminal β-signal varies among different classes of bacteria though its consensus sequence is LGLGYRF. However, this β-signal is only characteristic of prototypical TMBBs. The ten non-prototypical barrel types have other C-terminal motifs, and it remains to be determined if these alternative motifs facilitate TMBB insertion or perform any other signaling function.
Collapse
|
4
|
Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations. CELL REPORTS METHODS 2021; 1:100014. [PMID: 34355210 PMCID: PMC8336924 DOI: 10.1016/j.crmeth.2021.100014] [Citation(s) in RCA: 215] [Impact Index Per Article: 71.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 04/22/2021] [Accepted: 05/03/2021] [Indexed: 12/23/2022]
Abstract
Structure prediction for proteins lacking homologous templates in the Protein Data Bank (PDB) remains a significant unsolved problem. We developed a protocol, C-I-TASSER, to integrate interresidue contact maps from deep neural-network learning with the cutting-edge I-TASSER fragment assembly simulations. Large-scale benchmark tests showed that C-I-TASSER can fold more than twice the number of non-homologous proteins than the I-TASSER, which does not use contacts. When applied to a folding experiment on 8,266 unsolved Pfam families, C-I-TASSER successfully folded 4,162 domain families, including 504 folds that are not found in the PDB. Furthermore, it created correct folds for 85% of proteins in the SARS-CoV-2 genome, despite the quick mutation rate of the virus and sparse sequence profiles. The results demonstrated the critical importance of coupling whole-genome and metagenome-based evolutionary information with optimal structure assembly simulations for solving the problem of non-homologous protein structure prediction.
Collapse
|
5
|
Toward the solution of the protein structure prediction problem. J Biol Chem 2021; 297:100870. [PMID: 34119522 PMCID: PMC8254035 DOI: 10.1016/j.jbc.2021.100870] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/07/2021] [Accepted: 06/09/2021] [Indexed: 11/20/2022] Open
Abstract
Since Anfinsen demonstrated that the information encoded in a protein's amino acid sequence determines its structure in 1973, solving the protein structure prediction problem has been the Holy Grail of structural biology. The goal of protein structure prediction approaches is to utilize computational modeling to determine the spatial location of every atom in a protein molecule starting from only its amino acid sequence. Depending on whether homologous structures can be found in the Protein Data Bank (PDB), structure prediction methods have been historically categorized as template-based modeling (TBM) or template-free modeling (FM) approaches. Until recently, TBM has been the most reliable approach to predicting protein structures, and in the absence of reliable templates, the modeling accuracy sharply declines. Nevertheless, the results of the most recent community-wide assessment of protein structure prediction experiment (CASP14) have demonstrated that the protein structure prediction problem can be largely solved through the use of end-to-end deep machine learning techniques, where correct folds could be built for nearly all single-domain proteins without using the PDB templates. Critically, the model quality exhibited little correlation with the quality of available template structures, as well as the number of sequence homologs detected for a given target protein. Thus, the implementation of deep-learning techniques has essentially broken through the 50-year-old modeling border between TBM and FM approaches and has made the success of high-resolution structure prediction significantly less dependent on template availability in the PDB library.
Collapse
|
6
|
Structure elements can be predicted using the contact volume among protein residues. Biophys Physicobiol 2021; 18:50-59. [PMID: 33954082 PMCID: PMC8049775 DOI: 10.2142/biophysico.bppb-v18.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/15/2021] [Indexed: 12/01/2022] Open
Abstract
Previously, the structure elements of dihydrofolate reductase (DHFR) were determined using comprehensive Ala-insertion mutation analysis, which is assumed to be a kind of protein “building blocks.” It is hypothesized that our comprehension of the structure elements could lead to understanding how an amino acid sequence dictates its tertiary structure. However, the comprehensive Ala-insertion mutation analysis is a time- and cost-consuming process and only a set of the DHFR structure elements have been reported so far. Therefore, developing a computational method to predict structure elements is an urgent necessity. We focused on intramolecular residue–residue contacts to predict the structure elements. We introduced a simple and effective parameter: the overlapped contact volume (CV) among the residues and calculated the CV along the DHFR sequence using the crystal structure. Our results indicate that the CV profile can recapitulate its precipitate ratio profile, which was used to define the structure elements in the Ala-insertion mutation analysis. The CV profile allowed us to predict structure elements like the experimentally determined structure elements. The strong correlation between the CV and precipitate ratio profiles indicates the importance of the intramolecular residue–residue contact in maintaining the tertiary structure. Additionally, the CVs between the structure elements are considerably more than those between a structure element and a linker or two linkers, indicating that the structure elements play a fundamental role in increasing the intramolecular adhesion. Thus, we propose that the structure elements can be considered a type of “building blocks” that maintain and dictate the tertiary structures of proteins.
Collapse
|
7
|
Dimer Interface Organization is a Main Determinant of Intermonomeric Interactions and Correlates with Evolutionary Relationships of Retroviral and Retroviral-Like Ddi1 and Ddi2 Proteases. Int J Mol Sci 2020; 21:ijms21041352. [PMID: 32079302 PMCID: PMC7072860 DOI: 10.3390/ijms21041352] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 02/11/2020] [Accepted: 02/14/2020] [Indexed: 02/07/2023] Open
Abstract
The life cycles of retroviruses rely on the limited proteolysis catalyzed by the viral protease. Numerous eukaryotic organisms also express endogenously such proteases, which originate from retrotransposons or retroviruses, including DNA damage-inducible 1 and 2 (Ddi1 and Ddi2, respectively) proteins. In this study, we performed a comparative analysis based on the structural data currently available in Protein Data Bank (PDB) and Structural summaries of PDB entries (PDBsum) databases, with a special emphasis on the regions involved in dimerization of retroviral and retroviral-like Ddi proteases. In addition to Ddi1 and Ddi2, at least one member of all seven genera of the Retroviridae family was included in this comparison. We found that the studied retroviral and non-viral proteases show differences in the mode of dimerization and density of intermonomeric contacts, and distribution of the structural characteristics is in agreement with their evolutionary relationships. Multiple sequence and structure alignments revealed that the interactions between the subunits depend mainly on the overall organization of the dimer interface. We think that better understanding of the general and specific features of proteases may support the characterization of retroviral-like proteases.
Collapse
|
8
|
SPOT-Fold: Fragment-Free Protein Structure Prediction Guided by Predicted Backbone Structure and Contact Map. J Comput Chem 2019; 41:745-750. [PMID: 31845383 DOI: 10.1002/jcc.26132] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2019] [Revised: 10/07/2019] [Accepted: 12/01/2019] [Indexed: 02/01/2023]
Abstract
Protein structure determination has long been one of the most challenging problems in molecular biology for the past 60 years. Here we present an ab initio protein tertiary-structure prediction method assisted by predicted contact maps from SPOT-Contact and predicted dihedral angles from SPIDER 3. These predicted properties were then fed to the crystallography and NMR system (CNS) for restrained structure modeling. The resulted structures are first evaluated by the potential energy calculated by CNS, followed by dDFIRE energy function for model selections. The method called SPOT-Fold has been tested on 241 CASP targets between 67 and 670 amino acid residues, 60 randomly selected globular proteins under 100 amino acids. The method has a comparable accuracy to other contact-map-based modeling techniques. © 2019 Wiley Periodicals, Inc.
Collapse
|
9
|
Folding with a protein's native shortcut network. Proteins 2019; 86:924-934. [PMID: 29790602 DOI: 10.1002/prot.25524] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Revised: 04/13/2018] [Accepted: 05/14/2018] [Indexed: 11/09/2022]
Abstract
A complex network approach to protein folding is proposed, wherein a protein's contact map is reconceptualized as a network of shortcut edges, and folding is steered by a structural characteristic of this network. Shortcut networks are generated by a known message passing algorithm operating on protein residue networks. It is found that the shortcut networks of native structures (SCN0s) are relevant graph objects with which to study protein folding at a formal level. The logarithm form of their contact order (SCN0_lnCO) correlates significantly with folding rate of two-state and nontwo-state proteins. The clustering coefficient of SCN0s (CSCN0 ) correlates significantly with folding rate, transition-state placement and stability of two-state folders. Reasonable folding pathways for several model proteins are produced when CSCN0 is used to combine protein segments incrementally to form the native structure. The folding bias captured by CSCN0 is detectable in non-native structures, as evidenced by Molecular Dynamics simulation generated configurations for the fast folding Villin-headpiece peptide. These results support the use of shortcut networks to investigate the role protein geometry plays in the folding of both small and large globular proteins, and have implications for the design of multibody interaction schemes in folding models. One facet of this geometry is the set of native shortcut triangles, whose attributes are found to be well-suited to identify dehydrated intraprotein areas in tight turns, or at the interface of different secondary structure elements.
Collapse
|
10
|
Juicebox.js Provides a Cloud-Based Visualization System for Hi-C Data. Cell Syst 2018; 6:256-258.e1. [PMID: 29428417 PMCID: PMC6047755 DOI: 10.1016/j.cels.2018.01.001] [Citation(s) in RCA: 201] [Impact Index Per Article: 33.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Revised: 11/22/2017] [Accepted: 12/30/2017] [Indexed: 11/22/2022]
Abstract
Contact mapping experiments such as Hi-C explore how genomes fold in 3D. Here, we introduce Juicebox.js, a cloud-based web application for exploring the resulting datasets. Like the original Juicebox application, Juicebox.js allows users to zoom in and out of such datasets using an interface similar to Google Earth. Juicebox.js also has many features designed to facilitate data reproducibility and sharing. Furthermore, Juicebox.js encodes the exact state of the browser in a shareable URL. Creating a public browser for a new Hi-C dataset does not require coding and can be accomplished in under a minute. The web app also makes it possible to create interactive figures online that can complement or replace ordinary journal figures. When combined with Juicer, this makes the entire process of data analysis transparent, insofar as every step from raw reads to published figure is publicly available as open source code.
Collapse
|
11
|
Mechanisms for the inhibition of amyloid aggregation by small ligands. Biosci Rep 2016; 36:BSR20160101. [PMID: 27512096 PMCID: PMC5041158 DOI: 10.1042/bsr20160101] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 08/10/2016] [Indexed: 12/14/2022] Open
Abstract
This work investigates by biochemical, biophysical and MD techniques the opposite anti-amyloid properties of resveratrol and rosmarinic acid on the aggregation of hen egg white lysozyme (HEWL). Differences in association energy and contact maps were found that explain the different behaviours. The formation of amyloid aggregates is the hallmark of systemic and neurodegenerative disorders, also known as amyloidoses. Many proteins have been found to aggregate into amyloid-like fibrils and this process is recognized as a general tendency of polypeptides. Lysozyme, an antibacterial protein, is a well-studied model since it is associated in human with systemic amyloidosis and that is widely available from chicken eggs (HEWL, hen egg white lysozyme). In the present study we investigated the mechanism of interaction of aggregating HEWL with rosmarinic acid and resveratrol, that we verified to be effective and ineffective, respectively, in inhibiting aggregate formation. We used a multidisciplinary strategy to characterize such effects, combining biochemical and biophysical methods with molecular dynamics (MD) simulations on the HEWL peptide 49–64 to gain insights into the mechanisms and energy variations associated to amyloid formation and inhibition. MD revealed that neither resveratrol nor rosmarinic acid were able to compete with the initial formation of the β-sheet structure. We then tested the association of two β-sheets, representing the model of an amyloid core structure. MD showed that rosmarinic acid displayed an interaction energy and a contact map comparable to that of sheet pairings. On the contrary, resveratrol association energy was found to be much lower and its contact map largely different than that of sheet pairings. The overall characterization elucidated a possible mechanism explaining why, in this model, resveratrol is inactive in blocking fibril formation, whereas rosmarinic acid is instead a powerful inhibitor.
Collapse
|
12
|
Abstract
The contact map of a protein fold in the two-dimensional (2D) square lattice has arc length at least 3, and each internal vertex has degree at most 2, whereas the two terminal vertices have degree at most 3. Recently, Chen, Guo, Sun, and Wang studied the enumeration of [Formula: see text]-regular linear stacks, where each arc has length at least [Formula: see text] and the degree of each vertex is bounded by 2. Since the two terminal points in a protein fold in the 2D square lattice may form contacts with at most three adjacent lattice points, we are led to the study of extended [Formula: see text]-regular linear stacks, in which the degree of each terminal point is bounded by 3. This model is closed to real protein contact maps. Denote the generating functions of the [Formula: see text]-regular linear stacks and the extended [Formula: see text]-regular linear stacks by [Formula: see text] and [Formula: see text], respectively. We show that [Formula: see text] can be written as a rational function of [Formula: see text]. For a certain [Formula: see text], by eliminating [Formula: see text], we obtain an equation satisfied by [Formula: see text] and derive the asymptotic formula of the numbers of [Formula: see text]-regular linear stacks of length [Formula: see text].
Collapse
|
13
|
Fast assessment of structural models of ion channels based on their predicted current-voltage characteristics. Proteins 2015; 84:217-31. [PMID: 26650347 DOI: 10.1002/prot.24967] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2015] [Revised: 11/19/2015] [Accepted: 11/29/2015] [Indexed: 11/11/2022]
Abstract
Computational prediction of protein structures is a difficult task, which involves fast and accurate evaluation of candidate model structures. We propose to enhance single-model quality assessment with a functionality evaluation phase for proteins whose quantitative functional characteristics are known. In particular, this idea can be applied to evaluation of structural models of ion channels, whose main function - conducting ions - can be quantitatively measured with the patch-clamp technique providing the current-voltage characteristics. The study was performed on a set of KcsA channel models obtained from complete and incomplete contact maps. A fast continuous electrodiffusion model was used for calculating the current-voltage characteristics of structural models. We found that the computed charge selectivity and total current were sensitive to structural and electrostatic quality of models. In practical terms, we show that evaluating predicted conductance values is an appropriate method to eliminate models with an occluded pore or with multiple erroneously created pores. Moreover, filtering models on the basis of their predicted charge selectivity results in a substantial enrichment of the candidate set in highly accurate models. Tests on three other ion channels indicate that, in addition to being a proof of the concept, our function-oriented single-model quality assessment method can be directly applied to evaluation of structural models of some classes of protein channels. Finally, our work raises an important question whether a computational validation of functionality should be included in the evaluation process of structural models, whenever possible.
Collapse
|
14
|
AlloRep: A Repository of Sequence, Structural and Mutagenesis Data for the LacI/GalR Transcription Regulators. J Mol Biol 2015; 428:671-678. [PMID: 26410588 DOI: 10.1016/j.jmb.2015.09.015] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Revised: 09/04/2015] [Accepted: 09/17/2015] [Indexed: 11/20/2022]
Abstract
Protein families evolve functional variation by accumulating point mutations at functionally important amino acid positions. Homologs in the LacI/GalR family of transcription regulators have evolved to bind diverse DNA sequences and allosteric regulatory molecules. In addition to playing key roles in bacterial metabolism, these proteins have been widely used as a model family for benchmarking structural and functional prediction algorithms. We have collected manually curated sequence alignments for >3000 sequences, in vivo phenotypic and biochemical data for >5750 LacI/GalR mutational variants, and noncovalent residue contact networks for 65 LacI/GalR homolog structures. Using this rich data resource, we compared the noncovalent residue contact networks of the LacI/GalR subfamilies to design and experimentally validate an allosteric mutant of a synthetic LacI/GalR repressor for use in biotechnology. The AlloRep database (freely available at www.AlloRep.org) is a key resource for future evolutionary studies of LacI/GalR homologs and for benchmarking computational predictions of functional change.
Collapse
|
15
|
Abstract
The contact map of a protein fold is a graph that represents the patterns of contacts in the fold. It is known that the contact map can be decomposed into stacks and queues. RNA secondary structures are special stacks in which the degree of each vertex is at most one and each arc has length of at least two. Waterman and Smith derived a formula for the number of RNA secondary structures of length n with exactly k arcs. Höner zu Siederdissen et al. developed a folding algorithm for extended RNA secondary structures in which each vertex has maximum degree two. An equation for the generating function of extended RNA secondary structures was obtained by Müller and Nebel by using a context-free grammar approach, which leads to an asymptotic formula. In this article, we consider m-regular linear stacks, where each arc has length at least m and the degree of each vertex is bounded by two. Extended RNA secondary structures are exactly 2-regular linear stacks. For any m ≥ 2, we obtain an equation for the generating function of the m-regular linear stacks. For given m, we deduce a recurrence relation and an asymptotic formula for the number of m-regular linear stacks on n vertices. To establish the equation, we use the reduction operation of Chen, Deng, and Du to transform an m-regular linear stack to an m-reduced zigzag (or alternating) stack. Then we find an equation for m-reduced zigzag stacks leading to an equation for m-regular linear stacks.
Collapse
|