1
|
Li Z, Fan H, Ding W. Solving protein structures by combining structure prediction, molecular replacement and direct-methods-aided model completion. IUCRJ 2024; 11:152-167. [PMID: 38214490 PMCID: PMC10916285 DOI: 10.1107/s2052252523010291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 11/29/2023] [Indexed: 01/13/2024]
Abstract
Highly accurate protein structure prediction can generate accurate models of protein and protein-protein complexes in X-ray crystallography. However, the question of how to make more effective use of predicted models for completing structure analysis, and which strategies should be employed for the more challenging cases such as multi-helical structures, multimeric structures and extremely large structures, both in the model preparation and in the completion steps, remains open for discussion. In this paper, a new strategy is proposed based on the framework of direct methods and dual-space iteration, which can greatly simplify the pre-processing steps of predicted models both in normal and in challenging cases. Following this strategy, full-length models or the conservative structural domains could be used directly as the starting model, and the phase error and the model bias between the starting model and the real structure would be modified in the direct-methods-based dual-space iteration. Many challenging cases (from CASP14) have been tested for the general applicability of this constructive strategy, and almost complete models have been generated with reasonable statistics. The hybrid strategy therefore provides a meaningful scheme for X-ray structure determination using a predicted model as the starting point.
Collapse
Affiliation(s)
- Zengru Li
- Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China
- School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, People’s Republic of China
| | - Haifu Fan
- Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China
| | - Wei Ding
- Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, Chinese Academy of Sciences, Beijing 100190, People’s Republic of China
| |
Collapse
|
2
|
Kryshtafovych A, Rigden DJ. To split or not to split: CASP15 targets and their processing into tertiary structure evaluation units. Proteins 2023; 91:1558-1570. [PMID: 37254889 PMCID: PMC10687315 DOI: 10.1002/prot.26533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 05/02/2023] [Accepted: 05/18/2023] [Indexed: 06/01/2023]
Abstract
Processing of CASP15 targets into evaluation units (EUs) and assigning them to evolutionary-based prediction classes is presented in this study. The targets were first split into structural domains based on compactness and similarity to other proteins. Models were then evaluated against these domains and their combinations. The domains were joined into larger EUs if predictors' performance on the combined units was similar to that on individual domains. Alternatively, if most predictors performed better on the individual domains, then they were retained as EUs. As a result, 112 evaluation units were created from 77 tertiary structure prediction targets. The EUs were assigned to four prediction classes roughly corresponding to target difficulty categories in previous CASPs: TBM (template-based modeling, easy or hard), FM (free modeling), and the TBM/FM overlap category. More than a third of CASP15 EUs were attributed to the historically most challenging FM class, where homology or structural analogy to proteins of known fold cannot be detected.
Collapse
Affiliation(s)
| | - Daniel J. Rigden
- Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
3
|
Liu N, Yang Z, Liu Y, Dang X, Zhang Q, Wang J, Liu X, Zhang J, Pan X. Identification of a Putative SARS-CoV-2 Main Protease Inhibitor through In Silico Screening of Self-Designed Molecular Library. Int J Mol Sci 2023; 24:11390. [PMID: 37511149 PMCID: PMC10379331 DOI: 10.3390/ijms241411390] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 07/06/2023] [Accepted: 07/11/2023] [Indexed: 07/30/2023] Open
Abstract
There have been outbreaks of SARS-CoV-2 around the world for over three years, and its variants continue to evolve. This has become a major global health threat. The main protease (Mpro, also called 3CLpro) plays a key role in viral replication and proliferation, making it an attractive drug target. Here, we have identified a novel potential inhibitor of Mpro, by applying the virtual screening of hundreds of nilotinib-structure-like compounds that we designed and synthesized. The screened compounds were assessed using SP docking, XP docking, MM-GBSA analysis, IFD docking, MD simulation, ADME/T prediction, and then an enzymatic assay in vitro. We finally identified the compound V291 as a potential SARS-CoV-2 Mpro inhibitor, with a high docking affinity and enzyme inhibitory activity. Moreover, the docking results indicate that His41 is a favorable amino acid for pi-pi interactions, while Glu166 can participate in salt-bridge formation with the protonated primary or secondary amines in the screened molecules. Thus, the compounds reported here are capable of engaging the key amino acids His41 and Glu166 in ligand-receptor interactions. A pharmacophore analysis further validates this assertion.
Collapse
Affiliation(s)
- Nanxin Liu
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| | - Zeyu Yang
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| | - Yuying Liu
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| | - Xintao Dang
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| | - Qingqing Zhang
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| | - Jin Wang
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| | - Xueying Liu
- School of Pharmacy, The Fourth Military Medical University, Xi'an 710032, China
| | - Jie Zhang
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| | - Xiaoyan Pan
- School of Pharmacy, Health Science Center, Xi'an Jiaotong University, Xi'an 710061, China
| |
Collapse
|
4
|
Yu ZZ, Peng CX, Liu J, Zhang B, Zhou XG, Zhang GJ. DomBpred: Protein Domain Boundary Prediction Based on Domain-Residue Clustering Using Inter-Residue Distance. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:912-922. [PMID: 35594218 DOI: 10.1109/tcbb.2022.3175905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Domain boundary prediction is one of the most important problems in the study of protein structure and function, especially for large proteins. At present, most domain boundary prediction methods have low accuracy and limitations in dealing with multi-domain proteins. In this study, we develop a sequence-based protein domain boundary prediction, named DomBpred. In DomBpred, the input sequence is first classified as either a single-domain protein or a multi-domain protein through a designed effective sequence metric based on a constructed single-domain sequence library. For the multi-domain protein, a domain-residue clustering algorithm inspired by Ising model is proposed to cluster the spatially close residues according inter-residue distance. The unclassified residues and the residues at the edge of the cluster are then tuned by the secondary structure to form potential cut points. Finally, a domain boundary scoring function is proposed to recursively evaluate the potential cut points to generate the domain boundary. DomBpred is tested on a large-scale test set of FUpred comprising 2549 proteins. Experimental results show that DomBpred better performs than the state-of-the-art methods in classifying whether protein sequences are composed by single or multiple domains, and the Matthew's correlation coefficient is 0.882. Moreover, on 849 multi-domain proteins, the domain boundary distance and normalised domain overlap scores of DomBpred are 0.523 and 0.824, respectively, which are 5.0% and 4.2% higher than those of the best comparison method, respectively. Comparison with other methods on the given test set shows that DomBpred outperforms most state-of-the-art sequence-based methods and even achieves better results than the top-level template-based method. The executable program is freely available at https://github.com/iobio-zjut/DomBpred and the online server at http://zhanglab-bioinf.com/DomBpred/.
Collapse
|
5
|
Zhu K, Su H, Peng Z, Yang J. A unified approach to protein domain parsing with inter-residue distance matrix. Bioinformatics 2023; 39:7025502. [PMID: 36734597 PMCID: PMC9919455 DOI: 10.1093/bioinformatics/btad070] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 01/02/2023] [Accepted: 02/01/2023] [Indexed: 02/04/2023] Open
Abstract
MOTIVATION It is fundamental to cut multi-domain proteins into individual domains, for precise domain-based structural and functional studies. In the past, sequence-based and structure-based domain parsing was carried out independently with different methodologies. The recent progress in deep learning-based protein structure prediction provides the opportunity to unify sequence-based and structure-based domain parsing. RESULTS Based on the inter-residue distance matrix, which can be either derived from the input structure or predicted by trRosettaX, we can decode the domain boundaries under a unified framework. We name the proposed method UniDoc. The principle of UniDoc is based on the well-accepted physical concept of maximizing intra-domain interaction while minimizing inter-domain interaction. Comprehensive tests on five benchmark datasets indicate that UniDoc outperforms other state-of-the-art methods in terms of both accuracy and speed, for both sequence-based and structure-based domain parsing. The major contribution of UniDoc is providing a unified framework for structure-based and sequence-based domain parsing. We hope that UniDoc would be a convenient tool for protein domain analysis. AVAILABILITY AND IMPLEMENTATION https://yanglab.nankai.edu.cn/UniDoc/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kun Zhu
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Hong Su
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Zhenling Peng
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Jianyi Yang
- Ministry of Education Frontiers Science Center for Nonlinear Expectations, Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
6
|
Taheri-Ledari M, Zandieh A, Shariatpanahi SP, Eslahchi C. Assignment of structural domains in proteins using diffusion kernels on graphs. BMC Bioinformatics 2022; 23:369. [PMID: 36076174 PMCID: PMC9461149 DOI: 10.1186/s12859-022-04902-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 08/23/2022] [Indexed: 11/10/2022] Open
Abstract
Though proposing algorithmic approaches for protein domain decomposition has been of high interest, the inherent ambiguity to the problem makes it still an active area of research. Besides, accurate automated methods are in high demand as the number of solved structures for complex proteins is on the rise. While majority of the previous efforts for decomposition of 3D structures are centered on the developing clustering algorithms, employing enhanced measures of proximity between the amino acids has remained rather uncharted. If there exists a kernel function that in its reproducing kernel Hilbert space, structural domains of proteins become well separated, then protein structures can be parsed into domains without the need to use a complex clustering algorithm. Inspired by this idea, we developed a protein domain decomposition method based on diffusion kernels on protein graphs. We examined all combinations of four graph node kernels and two clustering algorithms to investigate their capability to decompose protein structures. The proposed method is tested on five of the most commonly used benchmark datasets for protein domain assignment plus a comprehensive non-redundant dataset. The results show a competitive performance of the method utilizing one of the diffusion kernels compared to four of the best automatic methods. Our method is also able to offer alternative partitionings for the same structure which is in line with the subjective definition of protein domain. With a competitive accuracy and balanced performance for the simple and complex structures despite relying on a relatively naive criterion to choose optimal decomposition, the proposed method revealed that diffusion kernels on graphs in particular, and kernel functions in general are promising measures to facilitate parsing proteins into domains and performing different structural analysis on proteins. The size and interconnectedness of the protein graphs make them promising targets for diffusion kernels as measures of affinity between amino acids. The versatility of our method allows the implementation of future kernels with higher performance. The source code of the proposed method is accessible at https://github.com/taherimo/kludo . Also, the proposed method is available as a web application from https://cbph.ir/tools/kludo .
Collapse
Affiliation(s)
- Mohammad Taheri-Ledari
- Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Amirali Zandieh
- Department of Biophysics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Seyed Peyman Shariatpanahi
- Department of Biophysics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | - Changiz Eslahchi
- Department of Computer and Data Sciences, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran. .,School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.
| |
Collapse
|
7
|
Wang L, Zhong H, Xue Z, Wang Y. Res-Dom: predicting protein domain boundary from sequence using deep residual network and Bi-LSTM. BIOINFORMATICS ADVANCES 2022; 2:vbac060. [PMID: 36699417 PMCID: PMC9710680 DOI: 10.1093/bioadv/vbac060] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2022] [Revised: 07/01/2022] [Accepted: 08/30/2022] [Indexed: 01/28/2023]
Abstract
Motivation Protein domains are the basic units of proteins that can fold, function and evolve independently. Protein domain boundary partition plays an important role in protein structure prediction, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Although there are many methods that have been developed to predict domain boundaries from protein sequence over the past two decades, there is still much room for improvement. Results In this article, a novel domain boundary prediction tool called Res-Dom was developed, which is based on a deep residual network, bidirectional long short-term memory (Bi-LSTM) and transfer learning. We used deep residual neural networks to extract higher-order residue-related information. In addition, we also used a pre-trained protein language model called ESM to extract sequence embedded features, which can summarize sequence context information more abundantly. To improve the global representation of these deep residual networks, a Bi-LSTM network was also designed to consider long-range interactions between residues. Res-Dom was then tested on an independent test set including 342 proteins and generated correct single-domain and multi-domain classifications with a Matthew's correlation coefficient of 0.668, which was 17.6% higher than the second-best compared method. For domain boundaries, the normalized domain overlapping score of Res-Dom was 0.849, which was 5% higher than the second-best compared method. Furthermore, Res-Dom required significantly less time than most of the recently developed state-of-the-art domain prediction methods. Availability and implementation All source code, datasets and model are available at http://isyslab.info/Res-Dom/.
Collapse
Affiliation(s)
- Lei Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Haolin Zhong
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zhidong Xue
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Yan Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical University, Yantai, Shandong 264003, China.,School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
8
|
Cretin G, Galochkina T, Vander Meersche Y, de Brevern AG, Postic G, Gelly JC. SWORD2: hierarchical analysis of protein 3D structures. Nucleic Acids Res 2022; 50:W732-W738. [PMID: 35580056 PMCID: PMC9252838 DOI: 10.1093/nar/gkac370] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Revised: 04/19/2022] [Accepted: 04/29/2022] [Indexed: 11/27/2022] Open
Abstract
Understanding the functions and origins of proteins requires splitting these macromolecules into fragments that could be independent in terms of folding, activity, or evolution. For that purpose, structural domains are the typical level of analysis, but shorter segments, such as subdomains and supersecondary structures, are insightful as well. Here, we propose SWORD2, a web server for exploring how an input protein structure may be decomposed into ‘Protein Units’ that can be hierarchically assembled to delimit structural domains. For each partitioning solution, the relevance of the identified substructures is estimated through different measures. This multilevel analysis is achieved by integrating our previous work on domain delineation, ‘protein peeling’ and model quality assessment. We hope that SWORD2 will be useful to biologists searching for key regions in their proteins of interest and to bioinformaticians building datasets of protein structures. The web server is freely available online: https://www.dsimb.inserm.fr/SWORD2.
Collapse
Affiliation(s)
- Gabriel Cretin
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Tatiana Galochkina
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Yann Vander Meersche
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Alexandre G de Brevern
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| | - Guillaume Postic
- Université Paris-Saclay, Univ Evry, IBISC, 91020 Evry-Courcouronnes, France
| | - Jean-Christophe Gelly
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.,Laboratoire d'Excellence GR-Ex, 75015 Paris, France
| |
Collapse
|
9
|
Zhou K, Litfin T, Solayman M, Zhao H, Zhou Y, Zhan J. High-throughput split-protein profiling by combining transposon mutagenesis and regulated protein-protein interactions with deep sequencing. Int J Biol Macromol 2022; 203:543-552. [PMID: 35120933 DOI: 10.1016/j.ijbiomac.2022.01.173] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 01/27/2022] [Accepted: 01/28/2022] [Indexed: 11/05/2022]
Abstract
Splitting a protein at a position may lead to self- or assisted-complementary fragments depending on whether two resulting fragments can reconstitute to maintain the native function spontaneously or require assistance from two interacting molecules. Assisted complementary fragments with high contrast are an important tool for probing biological interactions. However, only a small number of assisted-complementary split-variants have been identified due to manual, labour-intensive optimization of a candidate gene. Here, we introduce a technique for high-throughput split-protein profiling (HiTS) that allows fast identification of self- and assisted complementary positions by transposon mutagenesis, a rapamycin-regulated FRB-FKBP protein interaction pair, and deep sequencing. We test this technique by profiling three antibiotic-resistant genes (fosfomycin-resistant gene, fosA3, erythromycin-resistant gene, ermB, and chloramphenicol-resistant gene, catI). Self- and assisted complementary fragments discovered by the high-throughput technique were subsequently confirmed by low-throughput testing of individual split positions. Thus, the HiTS technique provides a quicker alternative for discovering the proteins with suitable self- and assisted-complementary split positions when combining with a readout such as fluorescence, bioluminescence, cell survival, gene transcription or genome editing.
Collapse
Affiliation(s)
- Kai Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia; Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China
| | - Thomas Litfin
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia
| | - Md Solayman
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia
| | - Huijun Zhao
- Centre for Clean Environment and Energy, Griffith University, Gold Coast Campus, Queensland 4222, Australia
| | - Yaoqi Zhou
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia; Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.
| | - Jian Zhan
- Institute for Glycomics and School of Information and Communication Technology, Griffith University, Parklands Dr Southport, QLD 4222, Australia; Institute for Systems and Physical Biology, Shenzhen Bay Laboratory, Shenzhen 518055, China.
| |
Collapse
|
10
|
Mulnaes D, Schott-Verdugo S, Koenig F, Gohlke H. TopProperty: Robust Metaprediction of Transmembrane and Globular Protein Features Using Deep Neural Networks. J Chem Theory Comput 2021; 17:7281-7289. [PMID: 34663069 DOI: 10.1021/acs.jctc.1c00685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Transmembrane proteins (TMPs) are critical components of cellular life. However, due to experimental challenges, the number of experimentally resolved TMP structures is severely underrepresented in databases compared to their cellular abundance. Prediction of (per-residue) features such as transmembrane topology, membrane exposure, secondary structure, and solvent accessibility can be a useful starting point for experimental design or protein structure prediction but often requires different computational tools for different features or types of proteins. We present TopProperty, a metapredictor that predicts all of these features for TMPs or globular proteins. TopProperty is trained on datasets without bias toward a high number of sequence homologs, and the predictions are significantly better than the evaluated state-of-the-art primary predictors on all quality metrics. TopProperty eliminates the need for protein type- or feature-tailored tools, specifically for TMPs. TopProperty is freely available as a web server and standalone at https://cpclab.uni-duesseldorf.de/topsuite/.
Collapse
Affiliation(s)
- Daniel Mulnaes
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Düsseldorf 40225, Germany
| | - Stephan Schott-Verdugo
- John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), Institute of Biological Information Processing (IBI-7: Structural Bioinformatics), and Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, Wilhelm-Johnen-Str., Jülich 52425, Germany
| | - Filip Koenig
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Düsseldorf 40225, Germany
| | - Holger Gohlke
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Düsseldorf 40225, Germany.,John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), Institute of Biological Information Processing (IBI-7: Structural Bioinformatics), and Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, Wilhelm-Johnen-Str., Jülich 52425, Germany
| |
Collapse
|
11
|
Kinch LN, Schaeffer RD, Kryshtafovych A, Grishin NV. Target classification in the 14th round of the critical assessment of protein structure prediction (CASP14). Proteins 2021; 89:1618-1632. [PMID: 34350630 DOI: 10.1002/prot.26202] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 06/21/2021] [Accepted: 07/11/2021] [Indexed: 12/14/2022]
Abstract
An evolutionary-based definition and classification of target evaluation units (EUs) is presented for the 14th round of the critical assessment of structure prediction (CASP14). CASP14 targets included 84 experimental models submitted by various structural groups (designated T1024-T1101). Targets were split into EUs based on the domain organization of available templates and performance of server groups. Several targets required splitting (19 out of 25 multidomain targets) due in part to observed conformation changes. All in all, 96 CASP14 EUs were defined and assigned to tertiary structure assessment categories (Topology-based FM or High Accuracy-based TBM-easy and TBM-hard) considering their evolutionary relationship to existing ECOD fold space: 24 family level, 50 distant homologs (H-group), 12 analogs (X-group), and 10 new folds. Principal component analysis and heatmap visualization of sequence and structure similarity to known templates as well as performance of servers highlighted trends in CASP14 target difficulty. The assigned evolutionary levels (i.e., H-groups) and assessment classes (i.e., FM) displayed overlapping clusters of EUs. Many viral targets diverged considerably from their template homologs and thus were more difficult for prediction than other homology-related targets. On the other hand, some targets did not have sequence-identifiable templates, but were predicted better than expected due to relatively simple arrangements of secondary structural elements. An apparent improvement in overall server performance in CASP14 further complicated traditional classification, which ultimately assigned EUs into high-accuracy modeling (27 TBM-easy and 31 TBM-hard), topology (23 FM), or both (15 FM/TBM).
Collapse
Affiliation(s)
- Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | | | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
12
|
Mulnaes D, Golchin P, Koenig F, Gohlke H. TopDomain: Exhaustive Protein Domain Boundary Metaprediction Combining Multisource Information and Deep Learning. J Chem Theory Comput 2021; 17:4599-4613. [PMID: 34161735 DOI: 10.1021/acs.jctc.1c00129] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
Protein domains are independent, functional, and stable structural units of proteins. Accurate protein domain boundary prediction plays an important role in understanding protein structure and evolution, as well as for protein structure prediction. Current domain boundary prediction methods differ in terms of boundary definition, methodology, and training databases resulting in disparate performance for different proteins. We developed TopDomain, an exhaustive metapredictor, that uses deep neural networks to combine multisource information from sequence- and homology-based features of over 50 primary predictors. For this purpose, we developed a new domain boundary data set termed the TopDomain data set, in which the true annotations are informed by SCOPe annotations, structural domain parsers, human inspection, and deep learning. We benchmark TopDomain against 2484 targets with 3354 boundaries from the TopDomain test set and achieve F1 scores of 78.4% and 73.8% for multidomain boundary prediction within ±20 residues and ±10 residues of the true boundary, respectively. When examined on targets from CASP11-13 competitions, TopDomain achieves F1 scores of 47.5% and 42.8% for multidomain proteins. TopDomain significantly outperforms 15 widely used, state-of-the-art ab initio and homology-based domain boundary predictors. Finally, we implemented TopDomainTMC, which accurately predicts whether domain parsing is necessary for the target protein.
Collapse
Affiliation(s)
- Daniel Mulnaes
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Pegah Golchin
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Filip Koenig
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany
| | - Holger Gohlke
- Institut für Pharmazeutische und Medizinische Chemie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, 40225 Düsseldorf, Germany.,John von Neumann Institute for Computing (NIC), Jülich Supercomputing Centre (JSC), Institute of Biological Information Processing (IBI-7: Structural Biochemistry) & Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, 52425 Jülich, Germany
| |
Collapse
|
13
|
Laine E, Grudinin S. HOPMA: Boosting Protein Functional Dynamics with Colored Contact Maps. J Phys Chem B 2021; 125:2577-2588. [PMID: 33687221 DOI: 10.1021/acs.jpcb.0c11633] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
In light of the recent very rapid progress in protein structure prediction, accessing the multitude of functional protein states is becoming more central than ever before. Indeed, proteins are flexible macromolecules, and they often perform their function by switching between different conformations. However, high-resolution experimental techniques such as X-ray crystallography and cryogenic electron microscopy can catch relatively few protein functional states. Many others are only accessible under physiological conditions in solution. Therefore, there is a pressing need to fill this gap with computational approaches. We present HOPMA, a novel method to predict protein functional states and transitions by using a modified elastic network model. The method exploits patterns in a protein contact map, taking its 3D structure as input, and excludes some disconnected patches from the elastic network. Combined with nonlinear normal mode analysis, this strategy boosts the protein conformational space exploration, especially when the input structure is highly constrained, as we demonstrate on a set of more than 400 transitions. Our results let us envision the discovery of new functional conformations, which were unreachable previously, starting from the experimentally known protein structures. The method is computationally efficient and available at https://github.com/elolaine/HOPMA and https://team.inria.fr/nano-d/software/nolb-normal-modes.
Collapse
Affiliation(s)
- Elodie Laine
- CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Sorbonne Université, 75005 Paris, France
| | - Sergei Grudinin
- CNRS, Inria, Grenoble INP, LJK, Univ. Grenoble Alpes, 38000 Grenoble, France
| |
Collapse
|
14
|
Wang Y, Zhang H, Zhong H, Xue Z. Protein domain identification methods and online resources. Comput Struct Biotechnol J 2021; 19:1145-1153. [PMID: 33680357 PMCID: PMC7895673 DOI: 10.1016/j.csbj.2021.01.041] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2020] [Revised: 01/25/2021] [Accepted: 01/26/2021] [Indexed: 01/03/2023] Open
Abstract
Protein domains are the basic units of proteins that can fold, function, and evolve independently. Knowledge of protein domains is critical for protein classification, understanding their biological functions, annotating their evolutionary mechanisms and protein design. Thus, over the past two decades, a number of protein domain identification approaches have been developed, and a variety of protein domain databases have also been constructed. This review divides protein domain prediction methods into two categories, namely sequence-based and structure-based. These methods are introduced in detail, and their advantages and limitations are compared. Furthermore, this review also provides a comprehensive overview of popular online protein domain sequence and structure databases. Finally, we discuss potential improvements of these prediction methods.
Collapse
Affiliation(s)
- Yan Wang
- Institute of Medical Artificial Intelligence, Binzhou Medical College, Yantai, Shandong 264003, China
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Hang Zhang
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Haolin Zhong
- School of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Zhidong Xue
- School of Software Engineering, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| |
Collapse
|
15
|
Zheng W, Zhou X, Wuyun Q, Pearce R, Li Y, Zhang Y. FUpred: detecting protein domains through deep-learning-based contact map prediction. Bioinformatics 2020; 36:3749-3757. [PMID: 32227201 DOI: 10.1093/bioinformatics/btaa217] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Revised: 02/27/2020] [Accepted: 03/25/2020] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Protein domains are subunits that can fold and function independently. Correct domain boundary assignment is thus a critical step toward accurate protein structure and function analyses. There is, however, no efficient algorithm available for accurate domain prediction from sequence. The problem is particularly challenging for proteins with discontinuous domains, which consist of domain segments that are separated along the sequence. RESULTS We developed a new algorithm, FUpred, which predicts protein domain boundaries utilizing contact maps created by deep residual neural networks coupled with coevolutionary precision matrices. The core idea of the algorithm is to retrieve domain boundary locations by maximizing the number of intra-domain contacts, while minimizing the number of inter-domain contacts from the contact maps. FUpred was tested on a large-scale dataset consisting of 2549 proteins and generated correct single- and multi-domain classifications with a Matthew's correlation coefficient of 0.799, which was 19.1% (or 5.3%) higher than the best machine learning (or threading)-based method. For proteins with discontinuous domains, the domain boundary detection and normalized domain overlapping scores of FUpred were 0.788 and 0.521, respectively, which were 17.3% and 23.8% higher than the best control method. The results demonstrate a new avenue to accurately detect domain composition from sequence alone, especially for discontinuous, multi-domain proteins. AVAILABILITY AND IMPLEMENTATION https://zhanglab.ccmb.med.umich.edu/FUpred. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - Qiqige Wuyun
- Computer Science and Engineering Department, Michigan State University, East Lansing, MI 48824, USA
| | - Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109
| | - Yang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.,School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
16
|
Farrell DP, Anishchenko I, Shakeel S, Lauko A, Passmore LA, Baker D, DiMaio F. Deep learning enables the atomic structure determination of the Fanconi Anemia core complex from cryoEM. IUCRJ 2020; 7:881-892. [PMID: 32939280 PMCID: PMC7467173 DOI: 10.1107/s2052252520009306] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/07/2020] [Accepted: 07/07/2020] [Indexed: 06/11/2023]
Abstract
Cryo-electron microscopy of protein complexes often leads to moderate resolution maps (4-8 Å), with visible secondary-structure elements but poorly resolved loops, making model building challenging. In the absence of high-resolution structures of homologues, only coarse-grained structural features are typically inferred from these maps, and it is often impossible to assign specific regions of density to individual protein subunits. This paper describes a new method for overcoming these difficulties that integrates predicted residue distance distributions from a deep-learned convolutional neural network, computational protein folding using Rosetta, and automated EM-map-guided complex assembly. We apply this method to a 4.6 Å resolution cryoEM map of Fanconi Anemia core complex (FAcc), an E3 ubiquitin ligase required for DNA interstrand crosslink repair, which was previously challenging to interpret as it comprises 6557 residues, only 1897 of which are covered by homology models. In the published model built from this map, only 387 residues could be assigned to the specific subunits with confidence. By building and placing into density 42 deep-learning-guided models containing 4795 residues not included in the previously published structure, we are able to determine an almost-complete atomic model of FAcc, in which 5182 of the 6557 residues were placed. The resulting model is consistent with previously published biochemical data, and facilitates interpretation of disease-related mutational data. We anticipate that our approach will be broadly useful for cryoEM structure determination of large complexes containing many subunits for which there are no homologues of known structure.
Collapse
Affiliation(s)
- Daniel P. Farrell
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Shabih Shakeel
- MRC Laboratory of Molecular Biology, Cambridge, United Kingdom
| | - Anna Lauko
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA
| | | | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Frank DiMaio
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| |
Collapse
|
17
|
Shi Q, Chen W, Huang S, Jin F, Dong Y, Wang Y, Xue Z. DNN-Dom: predicting protein domain boundary from sequence alone by deep neural network. Bioinformatics 2020; 35:5128-5136. [PMID: 31197306 DOI: 10.1093/bioinformatics/btz464] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 05/07/2019] [Accepted: 06/05/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Accurate delineation of protein domain boundary plays an important role for protein engineering and structure prediction. Although machine-learning methods are widely used to predict domain boundary, these approaches often ignore long-range interactions among residues, which have been proven to improve the prediction performance. However, how to simultaneously model the local and global interactions to further improve domain boundary prediction is still a challenging problem. RESULTS This article employs a hybrid deep learning method that combines convolutional neural network and gate recurrent units' models for domain boundary prediction. It not only captures the local and non-local interactions, but also fuses these features for prediction. Additionally, we adopt balanced Random Forest for classification to deal with high imbalance of samples and high dimensions of deep features. Experimental results show that our proposed approach (DNN-Dom) outperforms existing machine-learning-based methods for boundary prediction. We expect that DNN-Dom can be useful for assisting protein structure and function prediction. AVAILABILITY AND IMPLEMENTATION The method is available as DNN-Dom Server at http://isyslab.info/DNN-Dom/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiang Shi
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Weiya Chen
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Siqi Huang
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Fanglin Jin
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Yinghao Dong
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Yan Wang
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Zhidong Xue
- School of Software Engineering and College of Life Science & Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| |
Collapse
|
18
|
Hong SH, Joo K, Lee J. ConDo: protein domain boundary prediction using coevolutionary information. Bioinformatics 2020; 35:2411-2417. [PMID: 30500873 DOI: 10.1093/bioinformatics/bty973] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2018] [Revised: 11/15/2018] [Accepted: 11/29/2018] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Domain boundary prediction is one of the most important problems in the study of protein structure and function. Many sequence-based domain boundary prediction methods are either template-based or machine learning (ML) based. ML-based methods often perform poorly due to their use of only local (i.e. short-range) features. These conventional features such as sequence profiles, secondary structures and solvent accessibilities are typically restricted to be within 20 residues of the domain boundary candidate. RESULTS To address the performance of ML-based methods, we developed a new protein domain boundary prediction method (ConDo) that utilizes novel long-range features such as coevolutionary information in addition to the aforementioned local window features as inputs for ML. Toward this purpose, two types of coevolutionary information were extracted from multiple sequence alignment using direct coupling analysis: (i) partially aligned sequences, and (ii) correlated mutation information. Both the partially aligned sequence information and the modularity of residue-residue couplings possess long-range correlation information. AVAILABILITY AND IMPLEMENTATION https://github.com/gicsaw/ConDo.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Keehyoung Joo
- Center for Advanced Computation, Korea Institute for Advanced Study, Korea
| | - Jooyoung Lee
- School of Computational Sciences.,Center for Advanced Computation, Korea Institute for Advanced Study, Korea
| |
Collapse
|
19
|
Sagar A, Xue B. Recent Advances in Machine Learning Based Prediction of RNA-protein Interactions. Protein Pept Lett 2019; 26:601-619. [PMID: 31215361 DOI: 10.2174/0929866526666190619103853] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 04/04/2019] [Accepted: 06/01/2019] [Indexed: 12/18/2022]
Abstract
The interactions between RNAs and proteins play critical roles in many biological processes. Therefore, characterizing these interactions becomes critical for mechanistic, biomedical, and clinical studies. Many experimental methods can be used to determine RNA-protein interactions in multiple aspects. However, due to the facts that RNA-protein interactions are tissuespecific and condition-specific, as well as these interactions are weak and frequently compete with each other, those experimental techniques can not be made full use of to discover the complete spectrum of RNA-protein interactions. To moderate these issues, continuous efforts have been devoted to developing high quality computational techniques to study the interactions between RNAs and proteins. Many important progresses have been achieved with the application of novel techniques and strategies, such as machine learning techniques. Especially, with the development and application of CLIP techniques, more and more experimental data on RNA-protein interaction under specific biological conditions are available. These CLIP data altogether provide a rich source for developing advanced machine learning predictors. In this review, recent progresses on computational predictors for RNA-protein interaction were summarized in the following aspects: dataset, prediction strategies, and input features. Possible future developments were also discussed at the end of the review.
Collapse
Affiliation(s)
- Amit Sagar
- Department of Cell Biology, Microbiology and Molecular Biology, School of Natural Sciences and Mathematics, College of Arts and Sciences, University of South Florida, Tampa, Florida 33620, United States
| | - Bin Xue
- Department of Cell Biology, Microbiology and Molecular Biology, School of Natural Sciences and Mathematics, College of Arts and Sciences, University of South Florida, Tampa, Florida 33620, United States
| |
Collapse
|
20
|
Kinch LN, Kryshtafovych A, Monastyrskyy B, Grishin NV. CASP13 target classification into tertiary structure prediction categories. Proteins 2019; 87:1021-1036. [PMID: 31294862 DOI: 10.1002/prot.25775] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 06/24/2019] [Accepted: 07/06/2019] [Indexed: 12/30/2022]
Abstract
Protein target structures for the Critical Assessment of Structure Prediction round 13 (CASP13) were split into evaluation units (EUs) based on their structural domains, the domain organization of available templates, and the performance of servers on whole targets compared to split target domains. Eighty targets were split into 112 EUs. The EUs were classified into categories suitable for assessment of high accuracy modeling (or template-based modeling [TBM]) and topology (or free modeling [FM]) based on target difficulty. Assignment into assessment categories considered the following criteria: (a) the evolutionary relationship of target domains to existing fold space as defined by the Evolutionary Classification of Protein Domains (ECOD) database; (b) the clustering of target domains using eight objective sequence, structure, and performance measures; and (c) the placement of target domains in a scatter plot of target difficulty against server performance used in the previous CASP. Generally, target domains with good server predictions had close template homologs and were classified as TBM. Alternately, targets with poor server predictions represent a mixture of fast evolving homologs, structure analogs, and new folds, and were classified as FM or FM/TBM overlap.
Collapse
Affiliation(s)
- Lisa N Kinch
- Departments of Biophysics and Biochemistry, Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas
| | | | | | - Nick V Grishin
- Departments of Biophysics and Biochemistry, Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas
| |
Collapse
|
21
|
Litfin T, Yang Y, Zhou Y. SPOT-Peptide: Template-Based Prediction of Peptide-Binding Proteins and Peptide-Binding Sites. J Chem Inf Model 2019; 59:924-930. [DOI: 10.1021/acs.jcim.8b00777] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Affiliation(s)
- Thomas Litfin
- School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
| | - Yuedong Yang
- School of Data and Computer Science, Sun-Yat Sen University, Guangzhou, Guangdong 510006, China
| | - Yaoqi Zhou
- School of Information and Communication Technology, Griffith University, Southport, QLD 4222, Australia
- Institute for Glycomics, Griffith University, Southport, QLD 4222, Australia
| |
Collapse
|
22
|
Magyar C, Mentes A, Fichó E, Cserző M, Simon I. Physical Background of the Disordered Nature of "Mutual Synergetic Folding" Proteins. Int J Mol Sci 2018; 19:ijms19113340. [PMID: 30373142 PMCID: PMC6274838 DOI: 10.3390/ijms19113340] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 10/17/2018] [Accepted: 10/21/2018] [Indexed: 01/16/2023] Open
Abstract
Intrinsically disordered proteins (IDPs) lack a well-defined 3D structure. Their disordered nature enables them to interact with several other proteins and to fulfil their vital biological roles, in most cases after coupled folding and binding. In this paper, we analyze IDPs involved in a new mechanism, mutual synergistic folding (MSF). These proteins define a new subset of IDPs. Recently we collected information on these complexes and created the Mutual Folding Induced by Binding (MFIB) database. These protein complexes exhibit considerable structural variation, and almost half of them are homodimers, but there is a significant amount of heterodimers and various kinds of oligomers. In order to understand the basic background of the disordered character of the monomers found in MSF complexes, the simplest part of the MFIB database, the homodimers are analyzed here. We conclude that MFIB homodimeric proteins have a larger solvent-accessible main-chain surface area on the contact surface of the subunits, when compared to globular homodimeric proteins. The main driving force of the dimerization is the mutual shielding of the water-accessible backbones and the formation of extra intermolecular interactions.
Collapse
Affiliation(s)
- Csaba Magyar
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar Tudósok krt. 2, H-1117 Budapest, Hungary.
| | - Anikó Mentes
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar Tudósok krt. 2, H-1117 Budapest, Hungary.
| | - Erzsébet Fichó
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar Tudósok krt. 2, H-1117 Budapest, Hungary.
| | - Miklós Cserző
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar Tudósok krt. 2, H-1117 Budapest, Hungary.
- Department of Physiology, Faculty of Medicine, Semmelweis University, Tűzoltó u. 37-47, H-1094 Budapest, Hungary.
| | - István Simon
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Magyar Tudósok krt. 2, H-1117 Budapest, Hungary.
| |
Collapse
|
23
|
Navigating Among Known Structures in Protein Space. Methods Mol Biol 2018. [PMID: 30298400 DOI: 10.1007/978-1-4939-8736-8_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Present-day protein space is the result of 3.7 billion years of evolution, constrained by the underlying physicochemical qualities of the proteins. It is difficult to differentiate between evolutionary traces and effects of physicochemical constraints. Nonetheless, as a rule of thumb, instances of structural reuse, or focusing on structural similarity, are likely attributable to physicochemical constraints, whereas sequence reuse, or focusing on sequence similarity, may be more indicative of evolutionary relationships. Both types of relationships have been studied and can provide meaningful insights to protein biophysics and evolution, which in turn can lead to better algorithms for protein search, annotation, and maybe even design.In broad strokes, studies of protein space vary in the entities they represent, the similarity measure comparing these entities, and the representation used. The entities can be, for example, protein chains, domains, supra-domains, or smaller protein sub-parts denoted themes. The measures of similarity between the entities can be based on sequence, structure, function, or any combination of these. The representation can be global, encompassing the whole space, or local, focusing on a particular region surrounding protein(s) of interest. Global representations include lists of grouped proteins, protein networks, and maps. Networks are the abstraction that is derived most directly from the similarity data: each node is the protein entity (e.g., a domain), and edges connect similar domains. Selecting the entities, the similarity measure, and the abstraction are three intertwined decisions: the similarity measures allow us to identify the entities, and the selection of entities influences what is a meaningful similarity measure. Similarly, we seek entities that are related to each other in a way, for which a simple representation describes their relationships succinctly and accurately. This chapter will cover studies that rely on different entities, similarity measures, and a range of representations to better understand protein structure space. Scholars may use publicly available navigators offering a global representation, and in particular the hierarchical classifications SCOP, CATH, and ECOD, or a local representation, which encompass structural alignment algorithms. Alternatively, scholars can configure their own navigator using existing tools. To demonstrate this DIY (do it yourself) approach for navigating in protein space, we investigate substrate-binding proteins. By presenting sequence similarities among this large and diverse protein family as a network, we can infer that one member (pdb ID 4ntl; of yet unknown function) may bind methionine and suggest a putative binding mechanism.
Collapse
|
24
|
Kryshtafovych A, Adams PD, Lawson CL, Chiu W. Evaluation system and web infrastructure for the second cryo-EM model challenge. J Struct Biol 2018; 204:96-108. [PMID: 30017700 DOI: 10.1016/j.jsb.2018.07.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Revised: 07/06/2018] [Accepted: 07/10/2018] [Indexed: 01/01/2023]
Abstract
An evaluation system and a web infrastructure were developed for the second cryo-EM model challenge. The evaluation system includes tools to validate stereo-chemical plausibility of submitted models, check their fit to the corresponding density maps, estimate their overall and per-residue accuracy, and assess their similarity to reference cryo-EM or X-ray structures as well as other models submitted in this challenge. The web infrastructure provides a convenient interface for analyzing models at different levels of detail. It includes interactively sortable tables of evaluation scores for different subsets of models and different sublevels of structure organization, and a suite of visualization tools facilitating model analysis. The results are publicly accessible at http://model-compare.emdatabank.org.
Collapse
Affiliation(s)
- Andriy Kryshtafovych
- Genome Center, University of California, Davis, 451 Health Sciences Drive, Davis, CA 95616, USA.
| | - Paul D Adams
- Molecular Biophysics & Integrated Bioimaging, LBNL, CA 94720, USA; Department of Bioengineering, University of California Berkeley, CA 94720, USA
| | - Catherine L Lawson
- Institute for Quantitative Biomedicine and Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, 174 Frelinghuysen Road, Piscataway, NJ 08854, USA
| | - Wah Chiu
- Departments of Bioengineering and Microbiology & Immunology, Stanford University, Stanford, CA 94305-5447, USA; Division of CryoEM and Bioimaging, SLAC National Accelerator Laboratory, Menlo Park, CA 94025, USA
| |
Collapse
|
25
|
Abriata LA, Kinch LN, Tamò GE, Monastyrskyy B, Kryshtafovych A, Dal Peraro M. Definition and classification of evaluation units for tertiary structure prediction in CASP12 facilitated through semi-automated metrics. Proteins 2017; 86 Suppl 1:16-26. [DOI: 10.1002/prot.25403] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2017] [Revised: 10/03/2017] [Accepted: 10/11/2017] [Indexed: 01/31/2023]
Affiliation(s)
- Luciano A. Abriata
- Institute of Bioengineering, School of Life Sciences; École Polytechnique Fédérale de Lausanne (EPFL); Lausanne Switzerland
| | - Lisa N. Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas; Dallas Texas
| | - Giorgio E. Tamò
- Institute of Bioengineering, School of Life Sciences; École Polytechnique Fédérale de Lausanne (EPFL); Lausanne Switzerland
| | | | | | - Matteo Dal Peraro
- Institute of Bioengineering, School of Life Sciences; École Polytechnique Fédérale de Lausanne (EPFL); Lausanne Switzerland
| |
Collapse
|
26
|
Postic G, Ghouzam Y, Chebrek R, Gelly JC. An ambiguity principle for assigning protein structural domains. SCIENCE ADVANCES 2017; 3:e1600552. [PMID: 28097215 PMCID: PMC5235333 DOI: 10.1126/sciadv.1600552] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Accepted: 11/28/2016] [Indexed: 05/20/2023]
Abstract
Ambiguity is the quality of being open to several interpretations. For an image, it arises when the contained elements can be delimited in two or more distinct ways, which may cause confusion. We postulate that it also applies to the analysis of protein three-dimensional structure, which consists in dividing the molecule into subunits called domains. Because different definitions of what constitutes a domain can be used to partition a given structure, the same protein may have different but equally valid domain annotations. However, knowledge and experience generally displace our ability to accept more than one way to decompose the structure of an object-in this case, a protein. This human bias in structure analysis is particularly harmful because it leads to ignoring potential avenues of research. We present an automated method capable of producing multiple alternative decompositions of protein structure (web server and source code available at www.dsimb.inserm.fr/sword/). Our innovative algorithm assigns structural domains through the hierarchical merging of protein units, which are evolutionarily preserved substructures that describe protein architecture at an intermediate level, between domain and secondary structure. To validate the use of these protein units for decomposing protein structures into domains, we set up an extensive benchmark made of expert annotations of structural domains and including state-of-the-art domain parsing algorithms. The relevance of our "multipartitioning" approach is shown through numerous examples of applications covering protein function, evolution, folding, and structure prediction. Finally, we introduce a measure for the structural ambiguity of protein molecules.
Collapse
Affiliation(s)
- Guillaume Postic
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Corresponding author. (G.P.); (J.-C.G.)
| | - Yassine Ghouzam
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
| | - Romain Chebrek
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
| | - Jean-Christophe Gelly
- INSERM U1134, Paris, France
- Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France
- Institut National de la Transfusion Sanguine, Paris, France
- Laboratory of Excellence GR-Ex, Paris, France
- Corresponding author. (G.P.); (J.-C.G.)
| |
Collapse
|
27
|
Kryshtafovych A, Monastyrskyy B, Fidelis K. CASP11 statistics and the prediction center evaluation system. Proteins 2016; 84 Suppl 1:15-9. [PMID: 26857434 PMCID: PMC5479680 DOI: 10.1002/prot.25005] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Revised: 01/18/2016] [Accepted: 02/04/2016] [Indexed: 01/10/2023]
Abstract
We outline the role of the Protein Structure Prediction Center (predictioncenter.org) in conducting the CASP11 and CASP ROLL experiments, discuss the experiment statistics, and provide an overview of the present CASP infrastructure. The biggest changes compared to the previous CASPs are the implementation of the evaluation system incorporating practically all evaluation measures, statistical tests, and visualization tools historically used by the CASP assessors, the expansion of the infrastructure to incorporate new categories of contact-assisted and multimeric predictions, and the redesign of the assessors' web-workspace enabling assessments based on multiple measures for different group categories and target sets. Proteins 2016; 84(Suppl 1):15-19. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Andriy Kryshtafovych
- Protein Structure Prediction Center, Genome and Biomedical Sciences Facilities, University of California, Davis, California, 95616
| | - Bohdan Monastyrskyy
- Protein Structure Prediction Center, Genome and Biomedical Sciences Facilities, University of California, Davis, California, 95616
| | - Krzysztof Fidelis
- Protein Structure Prediction Center, Genome and Biomedical Sciences Facilities, University of California, Davis, California, 95616.
| |
Collapse
|
28
|
Wieninger SA, Ullmann GM. CoMoDo: Identifying Dynamic Protein Domains Based on Covariances of Motion. J Chem Theory Comput 2015; 11:2841-54. [DOI: 10.1021/acs.jctc.5b00150] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Affiliation(s)
- Silke A. Wieninger
- Structural Biology/Bioinformatics, University of Bayreuth, Universitätsstrasse 30, BGI, 95447 Bayreuth, Germany
| | - G. Matthias Ullmann
- Structural Biology/Bioinformatics, University of Bayreuth, Universitätsstrasse 30, BGI, 95447 Bayreuth, Germany
| |
Collapse
|
29
|
Zhao H, Yang Y, von Itzstein M, Zhou Y. Carbohydrate-binding protein identification by coupling structural similarity searching with binding affinity prediction. J Comput Chem 2014; 35:2177-83. [PMID: 25220682 DOI: 10.1002/jcc.23730] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Revised: 05/27/2014] [Accepted: 08/25/2014] [Indexed: 02/03/2023]
Abstract
Carbohydrate-binding proteins (CBPs) are potential biomarkers and drug targets. However, the interactions between carbohydrates and proteins are challenging to study experimentally and computationally because of their low binding affinity, high flexibility, and the lack of a linear sequence in carbohydrates as exists in RNA, DNA, and proteins. Here, we describe a structure-based function-prediction technique called SPOT-Struc that identifies carbohydrate-recognizing proteins and their binding amino acid residues by structural alignment program SPalign and binding affinity scoring according to a knowledge-based statistical potential based on the distance-scaled finite-ideal gas reference state (DFIRE). The leave-one-out cross-validation of the method on 113 carbohydrate-binding domains and 3442 noncarbohydrate binding proteins yields a Matthews correlation coefficient of 0.56 for SPalign alone and 0.63 for SPOT-Struc (SPalign + binding affinity scoring) for CBP prediction. SPOT-Struc is a technique with high positive predictive value (79% correct predictions in all positive CBP predictions) with a reasonable sensitivity (52% positive predictions in all CBPs). The sensitivity of the method was changed slightly when applied to 31 APO (unbound) structures found in the protein databank (14/31 for APO versus 15/31 for HOLO). The result of SPOT-Struc will not change significantly if highly homologous templates were used. SPOT-Struc predicted 19 out of 2076 structural genome targets as CBPs. In particular, one uncharacterized protein in Bacillus subtilis (1oq1A) was matched to galectin-9 from Mus musculus. Thus, SPOT-Struc is useful for uncovering novel carbohydrate-binding proteins. SPOT-Struc is available at http://sparks-lab.org.
Collapse
Affiliation(s)
- Huiying Zhao
- Indiana University School of Informatics, Indiana University Purdue University, Indianapolis, 719 Indiana Ave, Suite 319, Indianapolis, Indiana, 46202; Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, Indiana, 46202
| | | | | | | |
Collapse
|
30
|
Ansari ES, Eslahchi C, Pezeshk H, Sadeghi M. ProDomAs, protein domain assignment algorithm using center-based clustering and independent dominating set. Proteins 2014; 82:1937-46. [PMID: 24596179 DOI: 10.1002/prot.24547] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2013] [Revised: 02/09/2014] [Accepted: 02/20/2014] [Indexed: 11/07/2022]
Abstract
Decomposition of structural domains is an essential task in classifying protein structures, predicting protein function, and many other proteomics problems. As the number of known protein structures in PDB grows exponentially, the need for accurate automatic domain decomposition methods becomes more essential. In this article, we introduce a bottom-up algorithm for assigning protein domains using a graph theoretical approach. This algorithm is based on a center-based clustering approach. For constructing initial clusters, members of an independent dominating set for the graph representation of a protein are considered as the centers. A distance matrix is then defined for these clusters. To obtain final domains, these clusters are merged using the compactness principle of domains and a method similar to the neighbor-joining algorithm considering some thresholds. The thresholds are computed using a training set consisting of 50 protein chains. The algorithm is implemented using C++ language and is named ProDomAs. To assess the performance of ProDomAs, its results are compared with seven automatic methods, against five publicly available benchmarks. The results show that ProDomAs outperforms other methods applied on the mentioned benchmarks. The performance of ProDomAs is also evaluated against 6342 chains obtained from ASTRAL SCOP 1.71. ProDomAs is freely available at http://www.bioinf.cs.ipm.ir/software/prodomas.
Collapse
Affiliation(s)
- Elnaz Saberi Ansari
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | | | | | | |
Collapse
|
31
|
Taylor TJ, Tai CH, Huang YJ, Block J, Bai H, Kryshtafovych A, Montelione GT, Lee B. Definition and classification of evaluation units for CASP10. Proteins 2013; 82 Suppl 2:14-25. [PMID: 24123179 DOI: 10.1002/prot.24434] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2013] [Revised: 08/23/2013] [Accepted: 09/19/2013] [Indexed: 11/10/2022]
Abstract
For the 10th experiment on Critical Assessment of the techniques of protein Structure Prediction (CASP), the prediction target proteins were broken into independent evaluation units (EUs), which were then classified into template-based modeling (TBM) or free modeling (FM) categories. We describe here how the EUs were defined and classified, what issues arose in the process, and how we resolved them. EUs are frequently not the whole target proteins but the constituting structural domains. However, the assessors from CASP7 on combined more than one domain into 1 EU for some targets, which implied that the assessment also included evaluation of the prediction of the relative position and orientation of these domains. In CASP10, we followed and expanded this notion by defining multidomain EUs for a number of targets. These included 3 EUs, each made of two domains of familiar fold but arranged in a novel manner and for which the focus of evaluation was the interdomain arrangement. An EU was classified to the TBM category if a template could be found by sequence similarity searches and to FM if a structural template could not be found by structural similarity searches. The EUs that did not fall cleanly in either of these cases were classified case-by-case, often including consideration of the overall quality and characteristics of the predictions.
Collapse
Affiliation(s)
- Todd J Taylor
- Laboratory of Molecular Biology, Center for Cancer Research National Cancer Institute National Institutes of Health, Bethesda, Maryland, 20892-4264
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Song Y, DiMaio F, Wang RYR, Kim D, Miles C, Brunette T, Thompson J, Baker D. High-resolution comparative modeling with RosettaCM. Structure 2013; 21:1735-42. [PMID: 24035711 DOI: 10.1016/j.str.2013.08.005] [Citation(s) in RCA: 813] [Impact Index Per Article: 73.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2013] [Revised: 07/28/2013] [Accepted: 08/02/2013] [Indexed: 10/26/2022]
Abstract
We describe an improved method for comparative modeling, RosettaCM, which optimizes a physically realistic all-atom energy function over the conformational space defined by homologous structures. Given a set of sequence alignments, RosettaCM assembles topologies by recombining aligned segments in Cartesian space and building unaligned regions de novo in torsion space. The junctions between segments are regularized using a loop closure method combining fragment superposition with gradient-based minimization. The energies of the resulting models are optimized by all-atom refinement, and the most representative low-energy model is selected. The CASP10 experiment suggests that RosettaCM yields models with more accurate side-chain and backbone conformations than other methods when the sequence identity to the templates is greater than ∼15%.
Collapse
Affiliation(s)
- Yifan Song
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
| | | | | | | | | | | | | | | |
Collapse
|
33
|
Arab SS, Gharamaleki MP, Pashandi Z, Mobasseri R. Putracer: a novel method for identification of continuous-domains in multi-domain proteins. J Bioinform Comput Biol 2013; 11:1340012. [PMID: 23427994 DOI: 10.1142/s021972001340012x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
UNLABELLED Computer assisted assignment of protein domains is considered as an important issue in structural bioinformatics. The exponential increase in the number of known three dimensional protein structures and the significant role of proteins in biology, medicine and pharmacology illustrate the necessity of a reliable method to automatically detect structural domains as protein units. For this aim, we have developed a program based on the accessible surface area (ASA) and the hydrogen bonds energy in protein backbone (HBE). PUTracer (Protein Unit Tracer) is built on the features of a fast top-down approach to cut a chain into its domains (contiguous domains) with minimal change in ASA as well as HBE. Performance of the program was assessed by a comprehensive benchmark dataset of 124 protein chains, which is based on agreement among experts (e.g. CATH, SCOP) and was expanded to include structures with different types of domain combinations. Equal number of domains and at least 90% agreement in critical boundary accuracy were considered as correct assignment conditions. PUTracer assigned domains correctly in 81.45% of protein chains. Although low critical boundary accuracy in 18.55% of protein chains leads to the incorrect assignments, adjusting the scales causes to improve the performance up to 89.5%. We discuss here the success or failure of adjusting the scales with provided evidences. AVAILABILITY PUTracer is available at http://bioinf.modares.ac.ir/software/PUTracer/
Collapse
Affiliation(s)
- Seyed Shahriar Arab
- Department of Biophysics, Faculty of biological Sciences, Tarbiat Modares University-TMU, Tehran, Iran.
| | | | | | | |
Collapse
|
34
|
Ebina T, Umezawa Y, Kuroda Y. IS-Dom: a dataset of independent structural domains automatically delineated from protein structures. J Comput Aided Mol Des 2013; 27:419-26. [PMID: 23715893 DOI: 10.1007/s10822-013-9654-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2012] [Accepted: 05/07/2013] [Indexed: 11/25/2022]
Abstract
Protein domains that can fold in isolation are significant targets in diverse area of proteomics research as they are often readily analyzed by high-throughput methods. Here, we report IS-Dom, a dataset of Independent Structural Domains (ISDs) that are most likely to fold in isolation. IS-Dom was constructed by filtering domains from SCOP, CATH, and DomainParser using quantitative structural measures, which were calculated by estimating inter-domain hydrophobic clusters and hydrogen bonds from the full length protein's atomic coordinates. The ISD detection protocol is fully automated, and all of the computed interactions are stored in the server which enables rapid update of IS-Dom. We also prepared a standard IS-Dom using parameters optimized by maximizing the Youden's index. The standard IS-Dom, contained 54,860 ISDs, of which 25.5 % had high sequence identity and termini overlap with a Protein Data Bank (PDB) cataloged sequence and are thus experimentally shown to fold in isolation [coined autonomously folded domain (AFDs)]. Furthermore, our ISD detection protocol missed less than 10 % of the AFDs, which corroborated our protocol's ability to define structural domains that are able to fold independently. IS-Dom is available through the web server ( http://domserv.lab.tuat.ac.jp/IS-Dom.html ), and users can either, download the standard IS-Dom dataset, construct their own IS-Dom by interactively varying the parameters, or assess the structural independence of newly defined putative domains.
Collapse
Affiliation(s)
- Teppei Ebina
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 12-24-16 Nakamachi, Koganei-shi, Tokyo 184-8588, Japan.
| | | | | |
Collapse
|
35
|
Affiliation(s)
- Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa 31905, Israel;
| | - Leonid Pereyaslavets
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| | | | - Michael Levitt
- Department of Structural Biology, Stanford University, Stanford, California 94305; ,
| |
Collapse
|
36
|
Sadowski MI. Prediction of protein domain boundaries from inverse covariances. Proteins 2013; 81:253-60. [PMID: 22987736 PMCID: PMC3563215 DOI: 10.1002/prot.24181] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2012] [Revised: 08/10/2012] [Accepted: 09/04/2012] [Indexed: 01/04/2023]
Abstract
It has been known even since relatively few structures had been solved that longer protein chains often contain multiple domains, which may fold separately and play the role of reusable functional modules found in many contexts. In many structural biology tasks, in particular structure prediction, it is of great use to be able to identify domains within the structure and analyze these regions separately. However, when using sequence data alone this task has proven exceptionally difficult, with relatively little improvement over the naive method of choosing boundaries based on size distributions of observed domains. The recent significant improvement in contact prediction provides a new source of information for domain prediction. We test several methods for using this information including a kernel smoothing-based approach and methods based on building alpha-carbon models and compare performance with a length-based predictor, a homology search method and four published sequence-based predictors: DOMCUT, DomPRO, DLP-SVM, and SCOOBY-DOmain. We show that the kernel-smoothing method is significantly better than the other ab initio predictors when both single-domain and multidomain targets are considered and is not significantly different to the homology-based method. Considering only multidomain targets the kernel-smoothing method outperforms all of the published methods except DLP-SVM. The kernel smoothing method therefore represents a potentially useful improvement to ab initio domain prediction.
Collapse
Affiliation(s)
- Michael I Sadowski
- MRC National Institute for Medical Research, The Ridgeway, Mill Hill, London, United Kingdom.
| |
Collapse
|
37
|
Mohazab AR, Plotkin SS. Polymer uncrossing and knotting in protein folding, and their role in minimal folding pathways. PLoS One 2013; 8:e53642. [PMID: 23365638 PMCID: PMC3554774 DOI: 10.1371/journal.pone.0053642] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2012] [Accepted: 11/30/2012] [Indexed: 11/19/2022] Open
Abstract
We introduce a method for calculating the extent to which chain non-crossing is important in the most efficient, optimal trajectories or pathways for a protein to fold. This involves recording all unphysical crossing events of a ghost chain, and calculating the minimal uncrossing cost that would have been required to avoid such events. A depth-first tree search algorithm is applied to find minimal transformations to fold [Formula: see text], [Formula: see text], [Formula: see text], and knotted proteins. In all cases, the extra uncrossing/non-crossing distance is a small fraction of the total distance travelled by a ghost chain. Different structural classes may be distinguished by the amount of extra uncrossing distance, and the effectiveness of such discrimination is compared with other order parameters. It was seen that non-crossing distance over chain length provided the best discrimination between structural and kinetic classes. The scaling of non-crossing distance with chain length implies an inevitable crossover to entanglement-dominated folding mechanisms for sufficiently long chains. We further quantify the minimal folding pathways by collecting the sequence of uncrossing moves, which generally involve leg, loop, and elbow-like uncrossing moves, and rendering the collection of these moves over the unfolded ensemble as a multiple-transformation "alignment". The consensus minimal pathway is constructed and shown schematically for representative cases of an [Formula: see text], [Formula: see text], and knotted protein. An overlap parameter is defined between pathways; we find that [Formula: see text] proteins have minimal overlap indicating diverse folding pathways, knotted proteins are highly constrained to follow a dominant pathway, and [Formula: see text] proteins are somewhere in between. Thus we have shown how topological chain constraints can induce dominant pathway mechanisms in protein folding.
Collapse
Affiliation(s)
- Ali R. Mohazab
- Department of Physics and Astronomy, University of British Columbia, Vancouver, B.C, Canada
| | - Steven S. Plotkin
- Department of Physics and Astronomy, University of British Columbia, Vancouver, B.C, Canada
| |
Collapse
|
38
|
Feldman HJ. Identifying structural domains of proteins using clustering. BMC Bioinformatics 2012; 13:286. [PMID: 23116496 PMCID: PMC3534501 DOI: 10.1186/1471-2105-13-286] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2012] [Accepted: 10/29/2012] [Indexed: 11/16/2022] Open
Abstract
Background Protein structures are comprised of modular elements known as domains. These units are used and re-used over and over in nature, and usually serve some particular function in the structure. Thus it is useful to be able to break up a protein of interest into its component domains, prior to similarity searching for example. Numerous computational methods exist for doing so, but most operate only on a single protein chain and many are limited to making a series of cuts to the sequence, while domains can and do span multiple chains. Results This study presents a novel clustering-based approach to domain identification, which works equally well on individual chains or entire complexes. The method is simple and fast, taking only a few milliseconds to run, and works by clustering either vectors representing secondary structure elements, or buried alpha-carbon positions, using average-linkage clustering. Each resulting cluster corresponds to a domain of the structure. The method is competitive with others, achieving 70% agreement with SCOP on a large non-redundant data set, and 80% on a set more heavily weighted in multi-domain proteins on which both SCOP and CATH agree. Conclusions It is encouraging that a basic method such as this performs nearly as well or better than some far more complex approaches. This suggests that protein domains are indeed for the most part simply compact regions of structure with a higher density of buried contacts within themselves than between each other. By representing the structure as a set of points or vectors in space, it allows us to break free of any artificial limitations that other approaches may depend upon.
Collapse
|
39
|
Yang Y, Zhan J, Zhao H, Zhou Y. A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding prediction. Proteins 2012; 80:2080-8. [PMID: 22522696 DOI: 10.1002/prot.24100] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2012] [Revised: 04/13/2012] [Accepted: 04/17/2012] [Indexed: 11/12/2022]
Abstract
A structure alignment program aligns two structures by optimizing a scoring function that measures structural similarity. It is highly desirable that such scoring function is independent of the sizes of proteins in comparison so that the significance of alignment across different sizes of the protein regions aligned is comparable. Here, we developed a new score called SP-score that fixes the cutoff distance at 4 Å and removed the size dependence using a normalization prefactor. We further built a program called SPalign that optimizes SP-score for structure alignment. SPalign was applied to recognize proteins within the same structure fold and having the same function of DNA or RNA binding. For fold discrimination, SPalign improves sensitivity over TMalign for the chain-level comparison by 12% and over DALI for the domain-level comparison by 13% at the same specificity of 99.6%. The difference between TMalign and SPalign at the chain level is due to the inability of TMalign to detect single domain similarity between multidomain proteins. For recognizing nucleic acid binding proteins, SPalign consistently improves over TMalign by 12% and DALI by 31% in average value of Mathews correlation coefficients for four datasets. SPalign with default setting is 14% faster than TMalign. SPalign is expected to be useful for function prediction and comparing structures with or without domains defined. The source code for SPalign and the server are available at http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Yuedong Yang
- Indiana University School of Informatics, Indiana University-Purdue University, Indianapolis, Indiana 46202, USA
| | | | | | | |
Collapse
|
40
|
Genoni A, Morra G, Colombo G. Identification of domains in protein structures from the analysis of intramolecular interactions. J Phys Chem B 2012; 116:3331-43. [PMID: 22384792 DOI: 10.1021/jp210568a] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
The subdivision of protein structures into smaller and independent structural domains has a fundamental importance in understanding protein evolution and function and in the development of protein classification methods as well as in the interpretation of experimental data. Due to the rapid growth in the number of solved protein structures, the need for devising new accurate algorithmic methods has become more and more urgent. In this paper, we propose a new computational approach that is based on the concept of domain as a compact and independent folding unit and on the analysis of the residue-residue energy interactions obtainable through classical all-atom force field calculations. In particular, starting from the analysis of the nonbonded interaction energy matrix associated with a protein, our method filters out and selects only those specific subsets of interactions that define possible independent folding nuclei within a complex protein structure. This allows grouping different protein fragments into energy clusters that are found to correspond to structural domains. The strategy has been tested using proper benchmark data sets, and the results have shown that the new approach is fast and reliable in determining the number of domains in a totally ab initio manner and without making use of any training set or knowledge of the systems in exam. Moreover, our method, identifying the most relevant residues for the stabilization of each domain, may complement the results given by other classification techniques and may provide useful information to design and guide new experiments.
Collapse
Affiliation(s)
- Alessandro Genoni
- Istituto di Chimica del Riconoscimento Molecolare, CNR, Via Mario Bianco 9, 20131 Milano, Italy.
| | | | | |
Collapse
|
41
|
Abstract
The wealth of available protein structural data provides unprecedented opportunity to study and better understand the underlying principles of protein folding and protein structure evolution. A key to achieving this lies in the ability to analyse these data and to organize them in a coherent classification scheme. Over the past years several protein classifications have been developed that aim to group proteins based on their structural relationships. Some of these classification schemes explore the concept of structural neighbourhood (structural continuum), whereas other utilize the notion of protein evolution and thus provide a discrete rather than continuum view of protein structure space. This chapter presents a strategy for classification of proteins with known three-dimensional structure. Steps in the classification process along with basic definitions are introduced. Examples illustrating some fundamental concepts of protein folding and evolution with a special focus on the exceptions to them are presented.
Collapse
|
42
|
Zhao H, Yang Y, Zhou Y. Highly accurate and high-resolution function prediction of RNA binding proteins by fold recognition and binding affinity prediction. RNA Biol 2011; 8:988-96. [PMID: 21955494 DOI: 10.4161/rna.8.6.17813] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
A full understanding of the mechanism of post- transcriptional regulation requires more than simple two- state prediction (binding or not binding) for RNA binding proteins. Here we report a sequence-based technique dedicated for predicting complex structures of protein and RNA by combining fold recognition with binding affinity prediction. The method not only provides a highly accurate complex structure prediction (77% of residues are within 4°A RMSD from native in average for the independent test set) but also achieves the best performing two-state binding or non-binding prediction with an accuracy of 98%, precision of 84%, and Mathews correlation coefficient (MCC) of 0.62. Moreover, it predicts binding residues with an accuracy of 84%, precision of 66% and MCC value of 0.51. In addition, it has a success rate of 77% in predicting RNA binding types (mRNA, tRNA or rRNA). We further demonstrate that it makes more than 10% improvement either in precision or sensitivity than PSI- BLAST, HHPRED and our previously developed structure- based technique. This method expects to be useful for highly accurate genome-scale, high-resolution prediction of RNA-binding proteins and their complex structures. A web server (SPOT) is freely available for academic users at http://sparks.informatics.iupui.edu.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University, Indianapolis, IN, USA
| | | | | |
Collapse
|
43
|
Stivala A, Wybrow M, Wirth A, Whisstock JC, Stuckey PJ. Automatic generation of protein structure cartoons with Pro-origami. Bioinformatics 2011; 27:3315-6. [PMID: 21994221 DOI: 10.1093/bioinformatics/btr575] [Citation(s) in RCA: 134] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SUMMARY Protein topology diagrams are 2D representations of protein structure that are particularly useful in understanding and analysing complex protein folds. Generating such diagrams presents a major problem in graph drawing, with automatic approaches often resulting in errors or uninterpretable results. Here we apply a breakthrough in diagram layout to protein topology cartoons, providing clear, accurate, interactive and editable diagrams, which are also an interface to a structural search method. AVAILABILITY Pro-origami is available via a web server at http://munk.csse.unimelb.edu.au/pro-origami CONTACT a.stivala@pgrad.unimelb.edu.au; pjs@csse.unimelb.edu.au.
Collapse
Affiliation(s)
- Alex Stivala
- Department of Computer Science and Software Engineering, The University of Melbourne Parkville Campus, Victoria 3010, Australia.
| | | | | | | | | |
Collapse
|
44
|
Yang Y, Faraggi E, Zhao H, Zhou Y. Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates. Bioinformatics 2011; 27:2076-82. [PMID: 21666270 DOI: 10.1093/bioinformatics/btr350] [Citation(s) in RCA: 245] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION In recent years, development of a single-method fold-recognition server lags behind consensus and multiple template techniques. However, a good consensus prediction relies on the accuracy of individual methods. This article reports our efforts to further improve a single-method fold recognition technique called SPARKS by changing the alignment scoring function and incorporating the SPINE-X techniques that make improved prediction of secondary structure, backbone torsion angle and solvent accessible surface area. RESULTS The new method called SPARKS-X was tested with the SALIGN benchmark for alignment accuracy, Lindahl and SCOP benchmarks for fold recognition, and CASP 9 blind test for structure prediction. The method is compared to several state-of-the-art techniques such as HHPRED and BoostThreader. Results show that SPARKS-X is one of the best single-method fold recognition techniques. We further note that incorporating multiple templates and refinement in model building will likely further improve SPARKS-X. AVAILABILITY The method is available as a SPARKS-X server at http://sparks.informatics.iupui.edu/
Collapse
Affiliation(s)
- Yuedong Yang
- School of Informatics, Indiana University Purdue University, Indianapolis, IN 46202, USA
| | | | | | | |
Collapse
|
45
|
Esque J, Oguey C, de Brevern AG. Comparative Analysis of Threshold and Tessellation Methods for Determining Protein Contacts. J Chem Inf Model 2011; 51:493-507. [DOI: 10.1021/ci100195t] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Jeremy Esque
- LPTM, CNRS UMR 8089, Université de Cergy Pontoise, 2 av. Adolphe Chauvin, 95302 Cergy-Pontoise, France
- INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris 7, INTS, 6, rue Alexandre Cabanel, 75739 Paris Cedex 15, France
| | - Christophe Oguey
- LPTM, CNRS UMR 8089, Université de Cergy Pontoise, 2 av. Adolphe Chauvin, 95302 Cergy-Pontoise, France
| | - Alexandre G. de Brevern
- INSERM UMR-S 665, Dynamique des Structures et Interactions des Macromolécules Biologiques (DSIMB), Université Paris Diderot, Paris 7, INTS, 6, rue Alexandre Cabanel, 75739 Paris Cedex 15, France
| |
Collapse
|
46
|
Zhao H, Yang Y, Zhou Y. Structure-based prediction of RNA-binding domains and RNA-binding sites and application to structural genomics targets. Nucleic Acids Res 2010; 39:3017-25. [PMID: 21183467 PMCID: PMC3082898 DOI: 10.1093/nar/gkq1266] [Citation(s) in RCA: 94] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Mechanistic understanding of many key cellular processes often involves identification of RNA binding proteins (RBPs) and RNA binding sites in two separate steps. Here, they are predicted simultaneously by structural alignment to known protein-RNA complex structures followed by binding assessment with a DFIRE-based statistical energy function. This method achieves 98% accuracy and 91% precision for predicting RBPs and 93% accuracy and 78% precision for predicting RNA-binding amino-acid residues for a large benchmark of 212 RNA binding and 6761 non-RNA binding domains (leave-one-out cross-validation). Additional tests revealed that the method makes no false positive prediction from 311 DNA binding domains but correctly detects six domains binding with both DNA and RNA. In addition, it correctly identified 31 of 75 unbound RNA-binding domains with 92% accuracy and 65% precision for predicted binding residues and achieved 86% success rate in its application to SCOP RNA binding domain superfamily (Structural Classification Of Proteins). It further predicts 25 targets as RBPs in 2076 structural genomics targets: 20 of 25 predicted ones (80%) are putatively RNA binding. The superior performance over existing methods indicates the importance of dividing structures into domains, using a Z-score to measure relative structural similarity, and a statistical energy function to measure protein-RNA binding affinity.
Collapse
Affiliation(s)
- Huiying Zhao
- School of Informatics, Indiana University Purdue University and Center for Computational Biology and Bioinformatics, Indiana University School of Medicine, Indianapolis, IN 46202, USA
| | | | | |
Collapse
|
47
|
Alden K, Veretnik S, Bourne PE. dConsensus: a tool for displaying domain assignments by multiple structure-based algorithms and for construction of a consensus assignment. BMC Bioinformatics 2010; 11:310. [PMID: 20529369 PMCID: PMC2897830 DOI: 10.1186/1471-2105-11-310] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2010] [Accepted: 06/09/2010] [Indexed: 11/29/2022] Open
Abstract
Background Partitioning of a protein into structural components, known as domains, is an important initial step in protein classification and for functional and evolutionary studies. While the systematic assignments of domains by human experts exist (CATH and SCOP), the introduction of high throughput technologies for structure determination threatens to overwhelm expert approaches. A variety of algorithmic methods have been developed to expedite this process, allowing almost instant structural decomposition into domains. The performance of algorithmic methods can approach 85% agreement on the number of domains with the consensus reached by experts. However, each algorithm takes a somewhat different conceptual approach, each with unique strengths and weaknesses. Currently there is no simple way to automatically compare assignments from different structure-based domain assignment methods, thereby providing a comprehensive understanding of possible structure partitioning as well as providing some insight into the tendencies of particular algorithms. Most importantly, a consensus assignment drawn from multiple assignment methods can provide a singular and presumably more accurate view. Results We introduce dConsensus http://pdomains.sdsc.edu/dConsensus; a web resource that displays the results of calculations from multiple algorithmic methods and generates a domain assignment consensus with an associated reliability score. Domain assignments from seven structure-based algorithms - PDP, PUU, DomainParser2, NCBI method, DHcL, DDomains and Dodis are available for analysis and comparison alongside assignments made by expert methods. The assignments are available for all protein chains in the Protein Data Bank (PDB). A consensus domain assignment is built by either allowing each algorithm to contribute equally (simple approach) or by weighting the contribution of each method by its prior performance and observed tendencies. An analysis of secondary structure around domain and fragment boundaries is also available for display and further analysis. Conclusion dConsensus provides a comprehensive assignment of protein domains. For the first time, seven algorithmic methods are brought together with no need to access each method separately via a webserver or local copy of the software. This aggregation permits a consensus domain assignment to be computed. Comparison viewing of the consensus and choice methods provides the user with insights into the fundamental units of protein structure so important to the study of evolutionary and functional relationships.
Collapse
Affiliation(s)
- Kieran Alden
- York Centre for Complex Systems Analysis (YCCSA), University of York, Heslington, York, YO10 5DD, UK
| | | | | |
Collapse
|
48
|
An automated procedure for detecting protein folds from sub-nanometer resolution electron density. J Struct Biol 2009; 170:513-21. [PMID: 20026407 DOI: 10.1016/j.jsb.2009.12.014] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2009] [Revised: 12/09/2009] [Accepted: 12/15/2009] [Indexed: 11/24/2022]
Abstract
The use of sub-nanometer resolution electron density as spatial constraints for de novo and ab initio structure prediction requires knowledge of protein boundaries to accurately segment the electron density for the prediction algorithms. Here we present a procedure where even poorly segmented density can be used to determine the fold of the protein. The method is automated, fast, capable of searching for multiple copies of a protein fold, and accessible to densities encompassing more than a thousand residues. The automation is particularly powerful as it allows the procedure to take full advantage of the expanding repository in the Protein Data Bank. We have tested the method on nine segmented sub-nanometer image reconstruction electron densities. The method successfully identifies the correct fold for the six densities for which an atomic structure is known, identifies a fold that agrees with prior structural data, a fold that agrees with predictions from the Fold & Function Assignment server, and a fold that correlates with secondary structure prediction. The identified folds in the last three examples can be used as templates for comparative modeling of the bacteriophage P22 tail-machine (a 3MDa complex composed of 39 protein subunits).
Collapse
|
49
|
Gao M, Skolnick J. A threading-based method for the prediction of DNA-binding proteins with application to the human genome. PLoS Comput Biol 2009; 5:e1000567. [PMID: 19911048 PMCID: PMC2770119 DOI: 10.1371/journal.pcbi.1000567] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2009] [Accepted: 10/16/2009] [Indexed: 11/18/2022] Open
Abstract
Diverse mechanisms for DNA-protein recognition have been elucidated in numerous atomic complex structures from various protein families. These structural data provide an invaluable knowledge base not only for understanding DNA-protein interactions, but also for developing specialized methods that predict the DNA-binding function from protein structure. While such methods are useful, a major limitation is that they require an experimental structure of the target as input. To overcome this obstacle, we develop a threading-based method, DNA-Binding-Domain-Threader (DBD-Threader), for the prediction of DNA-binding domains and associated DNA-binding protein residues. Our method, which uses a template library composed of DNA-protein complex structures, requires only the target protein's sequence. In our approach, fold similarity and DNA-binding propensity are employed as two functional discriminating properties. In benchmark tests on 179 DNA-binding and 3,797 non-DNA-binding proteins, using templates whose sequence identity is less than 30% to the target, DBD-Threader achieves a sensitivity/precision of 56%/86%. This performance is considerably better than the standard sequence comparison method PSI-BLAST and is comparable to DBD-Hunter, which requires an experimental structure as input. Moreover, for over 70% of predicted DNA-binding domains, the backbone Root Mean Square Deviations (RMSDs) of the top-ranked structural models are within 6.5 A of their experimental structures, with their associated DNA-binding sites identified at satisfactory accuracy. Additionally, DBD-Threader correctly assigned the SCOP superfamily for most predicted domains. To demonstrate that DBD-Threader is useful for automatic function annotation on a large-scale, DBD-Threader was applied to 18,631 protein sequences from the human genome; 1,654 proteins are predicted to have DNA-binding function. Comparison with existing Gene Ontology (GO) annotations suggests that approximately 30% of our predictions are new. Finally, we present some interesting predictions in detail. In particular, it is estimated that approximately 20% of classic zinc finger domains play a functional role not related to direct DNA-binding.
Collapse
Affiliation(s)
- Mu Gao
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, Atlanta, Georgia, United States of America
- * E-mail:
| |
Collapse
|
50
|
Faure G, Bornot A, de Brevern AG. Analysis of protein contacts into Protein Units. Biochimie 2009; 91:876-87. [PMID: 19383526 DOI: 10.1016/j.biochi.2009.04.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2008] [Accepted: 04/13/2009] [Indexed: 11/18/2022]
Abstract
Three-dimensional structures of proteins are the support of their biological functions. Their folds are maintained by inter-residue interactions which are one of the main focuses to understand the mechanisms of protein folding and stability. Furthermore, protein structures can be composed of single or multiple functional domains that can fold and function independently. Hence, dividing a protein into domains is useful for obtaining an accurate structure and function determination. In previous studies, we enlightened protein contact properties according to different definitions and developed a novel methodology named Protein Peeling. Within protein structures, Protein Peeling characterizes small successive compact units along the sequence called protein units (PUs). The cutting done by Protein Peeling maximizes the number of contacts within the PUs and minimizes the number of contacts between them. This method is so a relevant tool in the context of the protein folding research and particularly regarding the hierarchical model proposed by George Rose. Here, we accurately analyze the PUs at different levels of cutting, using a non-redundant protein databank. Distribution of PU sizes, number of PUs or their accessibility are screened to determine their common and different features. Moreover, we highlight the preferential amino acid interactions inside and between PUs. Our results show that PUs are clearly an intermediate level between secondary structures and protein structural domains.
Collapse
Affiliation(s)
- Guilhem Faure
- INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire (EBGM), DSIMB, Université Paris Diderot - Paris 7, case 7113, 2 place Jussieu, 75251 Paris, France
| | | | | |
Collapse
|