1
|
Bordin N, Dallago C, Heinzinger M, Kim S, Littmann M, Rauer C, Steinegger M, Rost B, Orengo C. Novel machine learning approaches revolutionize protein knowledge. Trends Biochem Sci 2023; 48:345-359. [PMID: 36504138 PMCID: PMC10570143 DOI: 10.1016/j.tibs.2022.11.001] [Citation(s) in RCA: 32] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 10/24/2022] [Accepted: 11/17/2022] [Indexed: 12/10/2022]
Abstract
Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Christian Dallago
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; VantAI, 151 W 42nd Street, New York, NY 10036, USA
| | - Michael Heinzinger
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Stephanie Kim
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Maria Littmann
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea; Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Burkhard Rost
- Technical University of Munich (TUM) Department of Informatics, Bioinformatics and Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching/Munich, Germany; TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, Gower St, WC1E 6BT London, UK.
| |
Collapse
|
2
|
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:7112-7127. [PMID: 34232869 DOI: 10.1109/tpami.2021.3095381] [Citation(s) in RCA: 549] [Impact Index Per Article: 183.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.
Collapse
|
3
|
Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, Bhowmik D, Rost B. ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022. [PMID: 34232869 DOI: 10.1101/2020.07.12.199554] [Citation(s) in RCA: 71] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.
Collapse
|
4
|
Genomics-based strategies toward the identification of a Z-ISO carotenoid biosynthetic enzyme suitable for structural studies. Methods Enzymol 2022; 671:171-205. [PMID: 35878977 DOI: 10.1016/bs.mie.2021.12.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Over the past 20years, structural genomics efforts have proven enormously successful for the determination of integral membrane protein structures, particularly for those of prokaryotic origin. However, traditional genomic expansion screens have included up to hundreds of targets, necessitating the use of robotics and other automation not available to most laboratories. Moreover, such large-scale screens of eukaryotic targets are not easily performed at such a scale. To have broader appeal, traditional structural genomic approaches need to be modified and improved such that they are feasible for most laboratories and especially so for proteins from eukaryotic organisms. One such refinement, termed "microgenomic expansion," has been recently described. This approach improves the process of target selection by making target screening a two-step process, with a minimal number of targets tested at each step. Microgenomic expansion methods are applied here theoretically to a project that has the objective of acquiring a structure for the plant 15-cis-ζ-carotene isomerase, Z-ISO.
Collapse
|
5
|
Sen N, Anishchenko I, Bordin N, Sillitoe I, Velankar S, Baker D, Orengo C. Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs. Brief Bioinform 2022; 23:bbac187. [PMID: 35641150 PMCID: PMC9294430 DOI: 10.1093/bib/bbac187] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 04/23/2022] [Accepted: 04/27/2022] [Indexed: 12/12/2022] Open
Abstract
Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein-protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.
Collapse
Affiliation(s)
- Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| |
Collapse
|
6
|
Heinzinger M, Littmann M, Sillitoe I, Bordin N, Orengo C, Rost B. Contrastive learning on protein embeddings enlightens midnight zone. NAR Genom Bioinform 2022; 4:lqac043. [PMID: 35702380 PMCID: PMC9188115 DOI: 10.1093/nargab/lqac043] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 03/25/2022] [Accepted: 05/17/2022] [Indexed: 12/23/2022] Open
Abstract
Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.
Collapse
Affiliation(s)
- Michael Heinzinger
- TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, 85748 Garching, Germany
| | - Maria Littmann
- TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Burkhard Rost
- TUM (Technical University of Munich) Dept Informatics, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching, Germany & TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
7
|
Rahman ASMZ, Timmerman L, Gallardo F, Cardona ST. Identification of putative essential protein domains from high-density transposon insertion sequencing. Sci Rep 2022; 12:962. [PMID: 35046497 PMCID: PMC8770471 DOI: 10.1038/s41598-022-05028-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 12/29/2021] [Indexed: 12/24/2022] Open
Abstract
A first clue to gene function can be obtained by examining whether a gene is required for life in certain standard conditions, that is, whether a gene is essential. In bacteria, essential genes are usually identified by high-density transposon mutagenesis followed by sequencing of insertion sites (Tn-seq). These studies assign the term "essential" to whole genes rather than the protein domain sequences that encode the essential functions. However, genes can code for multiple protein domains that evolve their functions independently. Therefore, when essential genes code for more than one protein domain, only one of them could be essential. In this study, we defined this subset of genes as "essential domain-containing" (EDC) genes. Using a Tn-seq data set built-in Burkholderia cenocepacia K56-2, we developed an in silico pipeline to identify EDC genes and the essential protein domains they encode. We found forty candidate EDC genes and demonstrated growth defect phenotypes using CRISPR interference (CRISPRi). This analysis included two knockdowns of genes encoding the protein domains of unknown function DUF2213 and DUF4148. These putative essential domains are conserved in more than two hundred bacterial species, including human and plant pathogens. Together, our study suggests that essentiality should be assigned to individual protein domains rather than genes, contributing to a first functional characterization of protein domains of unknown function.
Collapse
Affiliation(s)
| | - Lukas Timmerman
- Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada
| | - Flyn Gallardo
- Department of Microbiology, University of Manitoba, Winnipeg, MB, Canada
| | - Silvia T Cardona
- Department of Microbiology, University of Manitoba, Winnipeg, MB, Canada.
- Department of Medical Microbiology and Infectious Diseases, University of Manitoba, Winnipeg, Canada.
| |
Collapse
|
8
|
Fine Sampling of Sequence Space for Membrane Protein Structural Biology. J Mol Biol 2021; 433:167055. [PMID: 34022208 DOI: 10.1016/j.jmb.2021.167055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Revised: 05/12/2021] [Accepted: 05/12/2021] [Indexed: 11/22/2022]
Abstract
We describe an enhancement of traditional genomics-based approaches to improve the success of structure determination of membrane proteins. Following a broad screen of sequence space to identify initial expression-positive targets, we employ a second step to select orthologs with closely related sequences to these hits. We demonstrate that a greater percentage of these latter targets express well and are stable in detergent, increasing the likelihood of identifying candidates that will ultimately yield structural information.
Collapse
|
9
|
Schafferhans A, O'Donoghue SI, Heinzinger M, Rost B. Dark Proteins Important for Cellular Function. Proteomics 2019; 18:e1800227. [PMID: 30318701 DOI: 10.1002/pmic.201800227] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2018] [Revised: 09/14/2018] [Indexed: 01/08/2023]
Abstract
Despite substantial and successful projects for structural genomics, many proteins remain for which neither experimental structures nor homology-based models are known for any part of the amino acid sequence. These have been called "dark proteins," in contrast to non-dark proteins, in which at least part of the sequence has a known or inferred structure. It has been hypothesized that non-dark proteins may be more abundantly expressed than dark proteins, which are known to have much fewer sequence relatives. Surprisingly, the opposite has been observed: human dark and non-dark proteins had quite similar levels of expression, in terms of both mRNA and protein abundance. Such high levels of expression strongly indicate that dark proteins-as a group-are important for cellular function. This is remarkable, given how carefully structural biologists have focused on proteins crucial for function, and highlights the important challenge posed by dark proteins in future research.
Collapse
Affiliation(s)
- Andrea Schafferhans
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748 Garching, Germany.,Department of Bioengineering Sciences, University of Applied Sciences, Freising, Germany
| | - Seán I O'Donoghue
- CSIRO Data61, Sydney, Australia.,Division of Genomics & Epigenetics, Garvan Institute of Medical Research, Sydney, Australia.,School of Biotechnology & Biomolecular Sciences, University of New South Wales (UNSW), Sydney, NSW, Australia
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748 Garching, Germany
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748 Garching, Germany.,Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching, Germany.,TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
10
|
Greener JG, Kandathil SM, Jones DT. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat Commun 2019; 10:3977. [PMID: 31484923 PMCID: PMC6726615 DOI: 10.1038/s41467-019-11994-0] [Citation(s) in RCA: 117] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 08/14/2019] [Indexed: 01/30/2023] Open
Abstract
The inapplicability of amino acid covariation methods to small protein families has limited their use for structural annotation of whole genomes. Recently, deep learning has shown promise in allowing accurate residue-residue contact prediction even for shallow sequence alignments. Here we introduce DMPfold, which uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces more accurate models than two popular methods for a test set of CASP12 domains, and works just as well for transmembrane proteins. Applied to all Pfam domains without known structures, confident models for 25% of these so-called dark families were produced in under a week on a small 200 core cluster. DMPfold provides models for 16% of human proteome UniProt entries without structures, generates accurate models with fewer than 100 sequences in some cases, and is freely available.
Collapse
Affiliation(s)
- Joe G Greener
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
- The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
- The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK
| | - David T Jones
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
- The Francis Crick Institute, 1 Midland Road, London, NW1 1AT, UK.
| |
Collapse
|
11
|
Scheibenreif L, Littmann M, Orengo C, Rost B. FunFam protein families improve residue level molecular function prediction. BMC Bioinformatics 2019; 20:400. [PMID: 31319797 PMCID: PMC6639920 DOI: 10.1186/s12859-019-2988-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 07/09/2019] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND The CATH database provides a hierarchical classification of protein domain structures including a sub-classification of superfamilies into functional families (FunFams). We analyzed the similarity of binding site annotations in these FunFams and incorporated FunFams into the prediction of protein binding residues. RESULTS FunFam members agreed, on average, in 36.9 ± 0.6% of their binding residue annotations. This constituted a 6.7-fold increase over randomly grouped proteins and a 1.2-fold increase (1.1-fold on the same dataset) over proteins with the same enzymatic function (identical Enzyme Commission, EC, number). Mapping de novo binding residue prediction methods (BindPredict-CCS, BindPredict-CC) onto FunFam resulted in consensus predictions for those residues that were aligned and predicted alike (binding/non-binding) within a FunFam. This simple consensus increased the F1-score (for binding) 1.5-fold over the original prediction method. Variation of the threshold for how many proteins in the consensus prediction had to agree provided a convenient control of accuracy/precision and coverage/recall, e.g. reaching a precision as high as 60.8 ± 0.4% for a stringent threshold. CONCLUSIONS The FunFams outperformed even the carefully curated EC numbers in terms of agreement of binding site residues. Additionally, we assume that our proof-of-principle through the prediction of protein binding residues will be relevant for many other solutions profiting from FunFams to infer functional information at the residue level.
Collapse
Affiliation(s)
- Linus Scheibenreif
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Maria Littmann
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Burkhard Rost
- Department of Informatics, Bioinformatics & Computational Biology - i12, TUM (Technical University of Munich), Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
- Department of Biochemistry and Molecular Biophysics & New York Consortium on Membrane Protein Structure (NYCOMPS), Columbia University, 701 West, 168th Street, New York, NY 10032, USA
| |
Collapse
|
12
|
Hu G, Wang K, Song J, Uversky VN, Kurgan L. Taxonomic Landscape of the Dark Proteomes: Whole-Proteome Scale Interplay Between Structural Darkness, Intrinsic Disorder, and Crystallization Propensity. Proteomics 2018; 18:e1800243. [PMID: 30198635 DOI: 10.1002/pmic.201800243] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2018] [Revised: 08/30/2018] [Indexed: 12/14/2022]
Abstract
Growth rate of the protein sequence universe dramatically exceeds the speed of expansion for the protein structure universe, generating an immense dark proteome that includes proteins with unknown structure. A whole-proteome scale analysis of 5.4 million proteins from 987 proteomes in the three domains of life and viruses to systematically dissect an interplay between structural coverage, degree of putative intrinsic disorder, and predicted propensity for structure determination is performed. It has been found that Archaean and Bacterial proteomes have relatively high structural coverage and low amounts of disorder, whereas Eukaryotic and Viral proteomes are characterized by a broad spread of structural coverage and higher disorder levels. The analysis reveals that dark proteomes (i.e., proteomes containing high fractions of proteins with unknown structure) have significantly elevated amounts of intrinsic disorder and are predicted to be difficult to solve structurally. Although the majority of dark proteomes are of viral origin, many dark viral proteomes have at least modest crystallization propensity and only a handful of them are enriched in the intrinsic disorder. The disorder, structural coverage, and propensity are mapped for structural determination onto a novel proteome-level sequence similarity network to analyze the interplay of these characteristics in the taxonomic landscape.
Collapse
Affiliation(s)
- Gang Hu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, P. R. China
| | - Kui Wang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, P. R. China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia.,Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Vladimir N Uversky
- Department of Molecular Medicine and USF Health Byrd Alzheimer's Research Institute, Morsani College of Medicine, University of South Florida, Tampa, 33612, USA.,Institute for Biological Instrumentation, Russian Academy of Sciences, Pushchino, 142290, Russia
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
13
|
Pellizza L, Smal C, Rodrigo G, Arán M. Codon usage clusters correlation: towards protein solubility prediction in heterologous expression systems in E. coli. Sci Rep 2018; 8:10618. [PMID: 30006617 PMCID: PMC6045634 DOI: 10.1038/s41598-018-29035-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2018] [Accepted: 06/21/2018] [Indexed: 12/15/2022] Open
Abstract
Production of soluble recombinant proteins is crucial to the development of industry and basic research. However, the aggregation due to the incorrect folding of the nascent polypeptides is still a mayor bottleneck. Understanding the factors governing protein solubility is important to grasp the underlying mechanisms and improve the design of recombinant proteins. Here we show a quantitative study of the expression and solubility of a set of proteins from Bizionia argentinensis. Through the analysis of different features known to modulate protein production, we defined two parameters based on the %MinMax algorithm to compare codon usage clusters between the host and the target genes. We demonstrate that the absolute difference between all %MinMax frequencies of the host and the target gene is significantly negatively correlated with protein expression levels. But most importantly, a strong positive correlation between solubility and the degree of conservation of codons usage clusters is observed for two independent datasets. Moreover, we evince that this correlation is higher in codon usage clusters involved in less compact protein secondary structure regions. Our results provide important tools for protein design and support the notion that codon usage may dictate translation rate and modulate co-translational folding.
Collapse
Affiliation(s)
- Leonardo Pellizza
- Laboratory of Nuclear Magnetic Resonance, Fundación Instituto Leloir, IIBBA-CONICET, Av. Patricias Argentinas 435, C1405BWE, CABA, Argentina
| | - Clara Smal
- Laboratory of Nuclear Magnetic Resonance, Fundación Instituto Leloir, IIBBA-CONICET, Av. Patricias Argentinas 435, C1405BWE, CABA, Argentina
| | - Guido Rodrigo
- Laboratory of Nuclear Magnetic Resonance, Fundación Instituto Leloir, IIBBA-CONICET, Av. Patricias Argentinas 435, C1405BWE, CABA, Argentina
| | - Martín Arán
- Laboratory of Nuclear Magnetic Resonance, Fundación Instituto Leloir, IIBBA-CONICET, Av. Patricias Argentinas 435, C1405BWE, CABA, Argentina.
| |
Collapse
|
14
|
Meng F, Wang C, Kurgan L. fDETECT webserver: fast predictor of propensity for protein production, purification, and crystallization. BMC Bioinformatics 2018; 18:580. [PMID: 29295714 PMCID: PMC6389161 DOI: 10.1186/s12859-017-1995-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2017] [Accepted: 12/06/2017] [Indexed: 02/26/2023] Open
Abstract
Background Development of predictors of propensity of protein sequences for successful crystallization has been actively pursued for over a decade. A few novel methods that expanded the scope of these predictions to address additional steps of protein production and structure determination pipelines were released in recent years. The predictive performance of the current methods is modest. This is because the only input that they use is the protein sequence and since the experimental annotations of these data might be inconsistent given that they were collected across many laboratories and centers. However, even these modest levels of predictive quality are still practical compared to the reported low success rates of crystallization, which are below 10%. We focus on another important aspect related to a high computational cost of running the predictors that offer the expanded scope. Results We introduce a novel fDETECT webserver that provides very fast and modestly accurate predictions of the success of protein production, purification, crystallization, and structure determination. Empirical tests on two datasets demonstrate that fDETECT is more accurate than the only other similarly fast method, and similarly accurate and three orders of magnitude faster than the currently most accurate predictors. Our method predicts a single protein in about 120 milliseconds and needs less than an hour to generate the four predictions for an entire human proteome. Moreover, we empirically show that fDETECT secures similar levels of predictive performance when compared with four representative methods that only predict success of crystallization, while it also provides the other three predictions. A webserver that implements fDETECT is available at http://biomine.cs.vcu.edu/servers/fDETECT/. Conclusions fDETECT is a computational tool that supports target selection for protein production and X-ray crystallography-based structure determination. It offers predictive quality that matches or exceeds other state-of-the-art tools and is especially suitable for the analysis of large protein sets.
Collapse
Affiliation(s)
- Fanchi Meng
- Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada
| | - Chen Wang
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
15
|
Abstract
In this review, we describe how the interplay among science, technology and community interests contributed to the evolution of four structural biology data resources. We present the method by which data deposited by scientists are prepared for worldwide distribution, and argue that data archiving in a trusted repository must be an integral part of any scientific investigation.
Collapse
Affiliation(s)
- Helen M. Berman
- Center for Integrative Proteomics Research, Institute for Quantitative Biomedicine, Department of Chemistry and Chemical Biology, 174 Frelinghuysen Road, Piscataway New Jersey 08854
| | - Catherine L. Lawson
- Center for Integrative Proteomics Research, Institute for Quantitative Biomedicine, Department of Chemistry and Chemical Biology, 174 Frelinghuysen Road, Piscataway New Jersey 08854
| | - Brinda Vallat
- Center for Integrative Proteomics Research, Institute for Quantitative Biomedicine, Department of Chemistry and Chemical Biology, 174 Frelinghuysen Road, Piscataway New Jersey 08854
| | - Margaret J. Gabanyi
- Center for Integrative Proteomics Research, Institute for Quantitative Biomedicine, Department of Chemistry and Chemical Biology, 174 Frelinghuysen Road, Piscataway New Jersey 08854
| |
Collapse
|
16
|
Dey S, Levy ED. Inferring and Using Protein Quaternary Structure Information from Crystallographic Data. Methods Mol Biol 2018; 1764:357-375. [PMID: 29605927 DOI: 10.1007/978-1-4939-7759-8_23] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A precise knowledge of the quaternary structure of proteins is essential to illuminate both their function and their evolution. The major part of our knowledge on quaternary structure is inferred from X-ray crystallography data, but this inference process is hard and error-prone. The difficulty lies in discriminating fortuitous protein contacts, which make up the lattice of protein crystals, from biological protein contacts that exist in the native cellular environment. Here, we review methods devised to discriminate between both types of contacts and describe resources for downloading protein quaternary structure information and identifying high-confidence quaternary structures. The use of high-confidence datasets of quaternary structures will be critical for the analysis of structural, functional, and evolutionary properties of proteins.
Collapse
Affiliation(s)
- Sucharita Dey
- Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel
| | - Emmanuel D Levy
- Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel.
| |
Collapse
|
17
|
Serrano P, Dutta SK, Proudfoot A, Mohanty B, Susac L, Martin B, Geralt M, Jaroszewski L, Godzik A, Elsliger M, Wilson IA, Wüthrich K. NMR in structural genomics to increase structural coverage of the protein universe: Delivered by Prof. Kurt Wüthrich on 7 July 2013 at the 38th FEBS Congress in St. Petersburg, Russia. FEBS J 2016; 283:3870-3881. [PMID: 27154589 DOI: 10.1111/febs.13751] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2016] [Revised: 04/12/2016] [Accepted: 05/04/2016] [Indexed: 12/12/2022]
Abstract
For more than a decade, the Joint Center for Structural Genomics (JCSG; www.jcsg.org) worked toward increased three-dimensional structure coverage of the protein universe. This coordinated quest was one of the main goals of the four high-throughput (HT) structure determination centers of the Protein Structure Initiative (PSI; www.nigms.nih.gov/Research/specificareas/PSI). To achieve the goals of the PSI, the JCSG made use of the complementarity of structure determination by X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy to increase and diversify the range of targets entering the HT structure determination pipeline. The overall strategy, for both techniques, was to determine atomic resolution structures for representatives of large protein families, as defined by the Pfam database, which had no structural coverage and could make significant contributions to biological and biomedical research. Furthermore, the experimental structures could be leveraged by homology modeling to further expand the structural coverage of the protein universe and increase biological insights. Here, we describe what could be achieved by this structural genomics approach, using as an illustration the contributions from 20 NMR structure determinations out of a total of 98 JCSG NMR structures, which were selected because they are the first three-dimensional structure representations of the respective Pfam protein families. The information from this small sample is representative for the overall results from crystal and NMR structure determination in the JCSG. There are five new folds, which were classified as domains of unknown functions (DUF), three of the proteins could be functionally annotated based on three-dimensional structure similarity with previously characterized proteins, and 12 proteins showed only limited similarity with previous deposits in the Protein Data Bank (PDB) and were classified as DUFs.
Collapse
Affiliation(s)
- Pedro Serrano
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Samit K Dutta
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Andrew Proudfoot
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Biswaranjan Mohanty
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.,Skaggs Institute for Chemical Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Lukas Susac
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Bryan Martin
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Michael Geralt
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Lukasz Jaroszewski
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Program on Bioinformatics and Systems Biology, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, USA
| | - Adam Godzik
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Program on Bioinformatics and Systems Biology, Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA, USA
| | - Marc Elsliger
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Ian A Wilson
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.,Skaggs Institute for Chemical Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Kurt Wüthrich
- Joint Center for Structural Genomics, The Scripps Research Institute, La Jolla, CA, USA.,Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.,Skaggs Institute for Chemical Biology, The Scripps Research Institute, La Jolla, CA, USA
| |
Collapse
|
18
|
The impact of structural genomics: the first quindecennial. ACTA ACUST UNITED AC 2016; 17:1-16. [PMID: 26935210 DOI: 10.1007/s10969-016-9201-5] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2015] [Accepted: 02/17/2016] [Indexed: 12/21/2022]
Abstract
The period 2000-2015 brought the advent of high-throughput approaches to protein structure determination. With the overall funding on the order of $2 billion (in 2010 dollars), the structural genomics (SG) consortia established worldwide have developed pipelines for target selection, protein production, sample preparation, crystallization, and structure determination by X-ray crystallography and NMR. These efforts resulted in the determination of over 13,500 protein structures, mostly from unique protein families, and increased the structural coverage of the expanding protein universe. SG programs contributed over 4400 publications to the scientific literature. The NIH-funded Protein Structure Initiatives alone have produced over 2000 scientific publications, which to date have attracted more than 93,000 citations. Software and database developments that were necessary to handle high-throughput structure determination workflows have led to structures of better quality and improved integrity of the associated data. Organized and accessible data have a positive impact on the reproducibility of scientific experiments. Most of the experimental data generated by the SG centers are freely available to the community and has been utilized by scientists in various fields of research. SG projects have created, improved, streamlined, and validated many protocols for protein production and crystallization, data collection, and functional analysis, significantly benefiting biological and biomedical research.
Collapse
|
19
|
Punta M, Mistry J. Homology-Based Annotation of Large Protein Datasets. Methods Mol Biol 2016; 1415:153-176. [PMID: 27115632 DOI: 10.1007/978-1-4939-3572-7_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Advances in DNA sequencing technologies have led to an increasing amount of protein sequence data being generated. Only a small fraction of this protein sequence data will have experimental annotation associated with them. Here, we describe a protocol for in silico homology-based annotation of large protein datasets that makes extensive use of manually curated collections of protein families. We focus on annotations provided by the Pfam database and suggest ways to identify family outliers and family variations. This protocol may be useful to people who are new to protein data analysis, or who are unfamiliar with the current computational tools that are available.
Collapse
Affiliation(s)
- Marco Punta
- Sorbonne Universités, UPMC-Univ P6, CNRS, Laboratoire de Biologie Computationnelle et Quantitative - UMR 7238, 15 rue de l'Ecole deMédecine, Paris, France.
| | - Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
20
|
Everett JK, Tejero R, Murthy SBK, Acton TB, Aramini JM, Baran MC, Benach J, Cort JR, Eletsky A, Forouhar F, Guan R, Kuzin AP, Lee HW, Liu G, Mani R, Mao B, Mills JL, Montelione AF, Pederson K, Powers R, Ramelot T, Rossi P, Seetharaman J, Snyder D, Swapna GVT, Vorobiev SM, Wu Y, Xiao R, Yang Y, Arrowsmith CH, Hunt JF, Kennedy MA, Prestegard JH, Szyperski T, Tong L, Montelione GT. A community resource of experimental data for NMR / X-ray crystal structure pairs. Protein Sci 2015; 25:30-45. [PMID: 26293815 DOI: 10.1002/pro.2774] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2015] [Accepted: 08/17/2015] [Indexed: 12/11/2022]
Abstract
We have developed an online NMR / X-ray Structure Pair Data Repository. The NIGMS Protein Structure Initiative (PSI) has provided many valuable reagents, 3D structures, and technologies for structural biology. The Northeast Structural Genomics Consortium was one of several PSI centers. NESG used both X-ray crystallography and NMR spectroscopy for protein structure determination. A key goal of the PSI was to provide experimental structures for at least one representative of each of hundreds of targeted protein domain families. In some cases, structures for identical (or nearly identical) constructs were determined by both NMR and X-ray crystallography. NMR spectroscopy and X-ray diffraction data for 41 of these "NMR / X-ray" structure pairs determined using conventional triple-resonance NMR methods with extensive sidechain resonance assignments have been organized in an online NMR / X-ray Structure Pair Data Repository. In addition, several NMR data sets for perdeuterated, methyl-protonated protein samples are included in this repository. As an example of the utility of this repository, these data were used to revisit questions about the precision and accuracy of protein NMR structures first outlined by Levy and coworkers several years ago (Andrec et al., Proteins 2007;69:449-465). These results demonstrate that the agreement between NMR and X-ray crystal structures is improved using modern methods of protein NMR spectroscopy. The NMR / X-ray Structure Pair Data Repository will provide a valuable resource for new computational NMR methods development.
Collapse
Affiliation(s)
- John K Everett
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Roberto Tejero
- Departamento De Química Física, Universidad De Valencia, Valencia, Spain
| | - Sarath B K Murthy
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Thomas B Acton
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - James M Aramini
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Michael C Baran
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Jordi Benach
- Department of Biological Sciences and Northeast Structural Genomics Consortium, Columbia University, New York, NY, 10027, USA
| | - John R Cort
- Fundamental and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington, 99354, USA
| | - Alexander Eletsky
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, New York, 14260, USA
| | - Farhad Forouhar
- Department of Biological Sciences and Northeast Structural Genomics Consortium, Columbia University, New York, NY, 10027, USA
| | - Rongjin Guan
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Alexandre P Kuzin
- Department of Biological Sciences and Northeast Structural Genomics Consortium, Columbia University, New York, NY, 10027, USA
| | - Hsiau-Wei Lee
- Complex Carbohydrate Research Center and Northeast Structural Genomics Consortium, University of Georgia, Athens, Georgia, 30602, USA
| | - Gaohua Liu
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Rajeswari Mani
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Binchen Mao
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Jeffrey L Mills
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, New York, 14260, USA
| | - Alexander F Montelione
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Kari Pederson
- Complex Carbohydrate Research Center and Northeast Structural Genomics Consortium, University of Georgia, Athens, Georgia, 30602, USA
| | - Robert Powers
- Department of Chemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, 68588, USA
| | - Theresa Ramelot
- Department of Chemistry and Biochemistry, Northeast Structural Genomics Consortium, Miami University, Oxford, Ohio, 45056, USA
| | - Paolo Rossi
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Jayaraman Seetharaman
- Department of Biological Sciences and Northeast Structural Genomics Consortium, Columbia University, New York, NY, 10027, USA
| | - David Snyder
- Department of Chemistry, College of Science and Health, William Paterson University of NJ, Wayne, New Jersey, 07470, USA
| | - G V T Swapna
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Sergey M Vorobiev
- Department of Biological Sciences and Northeast Structural Genomics Consortium, Columbia University, New York, NY, 10027, USA
| | - Yibing Wu
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, New York, 14260, USA
| | - Rong Xiao
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| | - Yunhuang Yang
- Department of Chemistry and Biochemistry, Northeast Structural Genomics Consortium, Miami University, Oxford, Ohio, 45056, USA
| | - Cheryl H Arrowsmith
- Cancer Genomics & Proteomics, Department of Medical Biophysics, Ontario Cancer Institute, and Northeast Structural Genomics Consortium, University of Toronto, Toronto, Ontario, M5G 1L7, Canada
| | - John F Hunt
- Department of Biological Sciences and Northeast Structural Genomics Consortium, Columbia University, New York, NY, 10027, USA
| | - Michael A Kennedy
- Department of Chemistry and Biochemistry, Northeast Structural Genomics Consortium, Miami University, Oxford, Ohio, 45056, USA
| | - James H Prestegard
- Complex Carbohydrate Research Center and Northeast Structural Genomics Consortium, University of Georgia, Athens, Georgia, 30602, USA
| | - Thomas Szyperski
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, New York, 14260, USA
| | - Liang Tong
- Department of Biological Sciences and Northeast Structural Genomics Consortium, Columbia University, New York, NY, 10027, USA
| | - Gaetano T Montelione
- Center for Advanced Biotechnology and Medicine, Department of Molecular Biology and Biochemistry, and Northeast Structural Genomics Consortium, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA.,Department of Biochemistry, Robert Wood Johnson Medical School, Rutgers, the State University of New Jersey, Piscataway, New Jersey, 08854, USA
| |
Collapse
|
21
|
Pujato M, Kieken F, Skiles AA, Tapinos N, Fiser A. Prediction of DNA binding motifs from 3D models of transcription factors; identifying TLX3 regulated genes. Nucleic Acids Res 2014; 42:13500-12. [PMID: 25428367 PMCID: PMC4267649 DOI: 10.1093/nar/gku1228] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Proper cell functioning depends on the precise spatio-temporal expression of its genetic material. Gene expression is controlled to a great extent by sequence-specific transcription factors (TFs). Our current knowledge on where and how TFs bind and associate to regulate gene expression is incomplete. A structure-based computational algorithm (TF2DNA) is developed to identify binding specificities of TFs. The method constructs homology models of TFs bound to DNA and assesses the relative binding affinity for all possible DNA sequences using a knowledge-based potential, after optimization in a molecular mechanics force field. TF2DNA predictions were benchmarked against experimentally determined binding motifs. Success rates range from 45% to 81% and primarily depend on the sequence identity of aligned target sequences and template structures, TF2DNA was used to predict 1321 motifs for 1825 putative human TF proteins, facilitating the reconstruction of most of the human gene regulatory network. As an illustration, the predicted DNA binding site for the poorly characterized T-cell leukemia homeobox 3 (TLX3) TF was confirmed with gel shift assay experiments. TLX3 motif searches in human promoter regions identified a group of genes enriched in functions relating to hematopoiesis, tissue morphology, endocrine system and connective tissue development and function.
Collapse
Affiliation(s)
- Mario Pujato
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
| | - Fabien Kieken
- Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA Macromolecular Therapeutics Development, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
| | - Amanda A Skiles
- Molecular Neuroscience Laboratory, Geisinger Clinic, 100 North Academy Avenue, Danville, PA 17822, USA
| | - Nikos Tapinos
- Molecular Neuroscience Laboratory, Geisinger Clinic, 100 North Academy Avenue, Danville, PA 17822, USA
| | - Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA Department of Biochemistry, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA
| |
Collapse
|
22
|
Scherer M, Klingl S, Sevvana M, Otto V, Schilling EM, Stump JD, Müller R, Reuter N, Sticht H, Muller YA, Stamminger T. Crystal structure of cytomegalovirus IE1 protein reveals targeting of TRIM family member PML via coiled-coil interactions. PLoS Pathog 2014; 10:e1004512. [PMID: 25412268 PMCID: PMC4239116 DOI: 10.1371/journal.ppat.1004512] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 10/09/2014] [Indexed: 01/08/2023] Open
Abstract
PML nuclear bodies (PML-NBs) are enigmatic structures of the cell nucleus that act as key mediators of intrinsic immunity against viral pathogens. PML itself is a member of the E3-ligase TRIM family of proteins that regulates a variety of innate immune signaling pathways. Consequently, viruses have evolved effector proteins to modify PML-NBs; however, little is known concerning structure-function relationships of viral antagonists. The herpesvirus human cytomegalovirus (HCMV) expresses the abundant immediate-early protein IE1 that colocalizes with PML-NBs and induces their dispersal, which correlates with the antagonization of NB-mediated intrinsic immunity. Here, we delineate the molecular basis for this antagonization by presenting the first crystal structure for the evolutionary conserved primate cytomegalovirus IE1 proteins. We show that IE1 consists of a globular core (IE1CORE) flanked by intrinsically disordered regions. The 2.3 Å crystal structure of IE1CORE displays an all α-helical, femur-shaped fold, which lacks overall fold similarity with known protein structures, but shares secondary structure features recently observed in the coiled-coil domain of TRIM proteins. Yeast two-hybrid and coimmunoprecipitation experiments demonstrate that IE1CORE binds efficiently to the TRIM family member PML, and is able to induce PML deSUMOylation. Intriguingly, this results in the release of NB-associated proteins into the nucleoplasm, but not of PML itself. Importantly, we show that PML deSUMOylation by IE1CORE is sufficient to antagonize PML-NB-instituted intrinsic immunity. Moreover, co-immunoprecipitation experiments demonstrate that IE1CORE binds via the coiled-coil domain to PML and also interacts with TRIM5α We propose that IE1CORE sequesters PML and possibly other TRIM family members via structural mimicry using an extended binding surface formed by the coiled-coil region. This mode of interaction might render the antagonizing activity less susceptible to mutational escape. Research of the last few years has revealed that microbial infections are not only controlled by innate and adaptive immune mechanisms, but also by cellular restriction factors, which give cells the capacity to resist pathogens. PML nuclear bodies (PML-NBs) are dot-like nuclear structures representing multiprotein complexes that consist of the PML protein, a member of the TRIM family of proteins, as well as a multitude of additional regulatory factors. PML-NB components act as a barrier against many viral infections; however, viral antagonistic proteins have evolved to modify PML-NBs, thus abrogating this cellular defense. Here, we delineate the molecular basis for antagonization by the immediate-early protein IE1 of the herpesvirus human cytomegalovirus. We present the first crystal structure for the evolutionary conserved core domain (IE1CORE) of primate cytomegalovirus IE1, which exhibits a novel, unusual fold. IE1CORE modifies PML-NBs by releasing other PML-NB proteins into the nucleoplasm which is sufficient to antagonize intrinsic immunity. Importantly, IE1CORE shares secondary structure features with the coiled-coil domain (CC) of TRIM factors, and we demonstrate strong binding of IE1 to the PML-CC. We propose that IE1CORE sequesters PML and possibly other TRIM family members via an extended binding surface formed by the coiled-coil domain.
Collapse
Affiliation(s)
- Myriam Scherer
- Institute for Clinical and Molecular Virology, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Stefan Klingl
- Division of Biotechnology, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Madhumati Sevvana
- Division of Biotechnology, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Victoria Otto
- Institute for Clinical and Molecular Virology, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Eva-Maria Schilling
- Institute for Clinical and Molecular Virology, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Joachim D. Stump
- Division of Bioinformatics, Institute of Biochemistry, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Regina Müller
- Institute for Clinical and Molecular Virology, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Nina Reuter
- Institute for Clinical and Molecular Virology, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Heinrich Sticht
- Division of Bioinformatics, Institute of Biochemistry, University of Erlangen-Nuremberg, Erlangen, Germany
| | - Yves A. Muller
- Division of Biotechnology, University of Erlangen-Nuremberg, Erlangen, Germany
- * E-mail: (YAM); (TS)
| | - Thomas Stamminger
- Institute for Clinical and Molecular Virology, University of Erlangen-Nuremberg, Erlangen, Germany
- * E-mail: (YAM); (TS)
| |
Collapse
|
23
|
Structural and functional characterization of DUF1471 domains of Salmonella proteins SrfN, YdgH/SssB, and YahO. PLoS One 2014; 9:e101787. [PMID: 25010333 PMCID: PMC4092069 DOI: 10.1371/journal.pone.0101787] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2014] [Accepted: 04/07/2014] [Indexed: 11/20/2022] Open
Abstract
Bacterial species in the Enterobacteriaceae typically contain multiple paralogues of a small domain of unknown function (DUF1471) from a family of conserved proteins also known as YhcN or BhsA/McbA. Proteins containing DUF1471 may have a single or three copies of this domain. Representatives of this family have been demonstrated to play roles in several cellular processes including stress response, biofilm formation, and pathogenesis. We have conducted NMR and X-ray crystallographic studies of four DUF1471 domains from Salmonella representing three different paralogous DUF1471 subfamilies: SrfN, YahO, and SssB/YdgH (two of its three DUF1471 domains: the N-terminal domain I (residues 21–91), and the C-terminal domain III (residues 244–314)). Notably, SrfN has been shown to have a role in intracellular infection by Salmonella Typhimurium. These domains share less than 35% pairwise sequence identity. Structures of all four domains show a mixed α+β fold that is most similar to that of bacterial lipoprotein RcsF. However, all four DUF1471 sequences lack the redox sensitive cysteine residues essential for RcsF activity in a phospho-relay pathway, suggesting that DUF1471 domains perform a different function(s). SrfN forms a dimer in contrast to YahO and SssB domains I and III, which are monomers in solution. A putative binding site for oxyanions such as phosphate and sulfate was identified in SrfN, and an interaction between the SrfN dimer and sulfated polysaccharides was demonstrated, suggesting a direct role for this DUF1471 domain at the host-pathogen interface.
Collapse
|
24
|
van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones D, Kim PM, Kriwacki R, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright P, Babu MM. Classification of intrinsically disordered regions and proteins. Chem Rev 2014; 114:6589-631. [PMID: 24773235 PMCID: PMC4095912 DOI: 10.1021/cr400525m] [Citation(s) in RCA: 1494] [Impact Index Per Article: 135.8] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2013] [Indexed: 12/11/2022]
Affiliation(s)
- Robin van der Lee
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
- Centre
for Molecular and Biomolecular Informatics, Radboud University Medical Centre, 6500 HB Nijmegen, The
Netherlands
| | - Marija Buljan
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| | - Benjamin Lang
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| | - Robert J. Weatheritt
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| | - Gary W. Daughdrill
- Department
of Cell Biology, Microbiology, and Molecular Biology, University of South Florida, 3720 Spectrum Boulevard, Suite 321, Tampa, Florida 33612, United States
| | - A. Keith Dunker
- Department
of Biochemistry and Molecular Biology, Indiana
University School of Medicine, Indianapolis, Indiana 46202, United States
| | - Monika Fuxreiter
- MTA-DE
Momentum Laboratory of Protein Dynamics, Department of Biochemistry
and Molecular Biology, University of Debrecen, H-4032 Debrecen, Nagyerdei krt 98, Hungary
| | - Julian Gough
- Department
of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, United Kingdom
| | - Joerg Gsponer
- Department
of Biochemistry and Molecular Biology, Centre for High-Throughput
Biology, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - David
T. Jones
- Bioinformatics
Group, Department of Computer Science, University
College London, London, WC1E 6BT, United Kingdom
| | - Philip M. Kim
- Terrence Donnelly Centre for Cellular and Biomolecular Research, Department of Molecular
Genetics, and Department of Computer Science, University
of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Richard
W. Kriwacki
- Department
of Structural Biology, St. Jude Children’s
Research Hospital, Memphis, Tennessee 38105, United States
| | - Christopher J. Oldfield
- Department
of Biochemistry and Molecular Biology, Indiana
University School of Medicine, Indianapolis, Indiana 46202, United States
| | - Rohit V. Pappu
- Department
of Biomedical Engineering and Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, United States
| | - Peter Tompa
- VIB Department
of Structural Biology, Vrije Universiteit
Brussel, Brussels, Belgium
- Institute
of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary
| | - Vladimir N. Uversky
- Department
of Molecular Medicine and USF Health Byrd Alzheimer’s Research
Institute, Morsani College of Medicine, University of South Florida, Tampa, Florida 33612, United States
- Institute for Biological Instrumentation,
Russian Academy of Sciences, Pushchino,
Moscow Region, Russia
| | - Peter
E. Wright
- Department
of Integrative Structural and Computational Biology and Skaggs Institute
of Chemical Biology, The Scripps Research
Institute, 10550 North
Torrey Pines Road, La Jolla, California 92037, United States
| | - M. Madan Babu
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| |
Collapse
|
25
|
Saul J, Petritis B, Sau S, Rauf F, Gaskin M, Ober-Reynolds B, Mineyev I, Magee M, Chaput J, Qiu J, LaBaer J. Development of a full-length human protein production pipeline. Protein Sci 2014; 23:1123-35. [PMID: 24806540 DOI: 10.1002/pro.2484] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Revised: 04/17/2014] [Accepted: 05/06/2014] [Indexed: 12/17/2022]
Abstract
There are many proteomic applications that require large collections of purified protein, but parallel production of large numbers of different proteins remains a very challenging task. To help meet the needs of the scientific community, we have developed a human protein production pipeline. Using high-throughput (HT) methods, we transferred the genes of 31 full-length proteins into three expression vectors, and expressed the collection as N-terminal HaloTag fusion proteins in Escherichia coli and two commercial cell-free (CF) systems, wheat germ extract (WGE) and HeLa cell extract (HCE). Expression was assessed by labeling the fusion proteins specifically and covalently with a fluorescent HaloTag ligand and detecting its fluorescence on a LabChip(®) GX microfluidic capillary gel electrophoresis instrument. This automated, HT assay provided both qualitative and quantitative assessment of recombinant protein. E. coli was only capable of expressing 20% of the test collection in the supernatant fraction with ≥20 μg yields, whereas CF systems had ≥83% success rates. We purified expressed proteins using an automated HaloTag purification method. We purified 20, 33, and 42% of the test collection from E. coli, WGE, and HCE, respectively, with yields ≥1 μg and ≥90% purity. Based on these observations, we have developed a triage strategy for producing full-length human proteins in these three expression systems.
Collapse
Affiliation(s)
- Justin Saul
- Virginia G. Piper Center for Personalized Diagnostics, Biodesign Institute, Arizona State University, Tempe, Arizona, 85287-6401
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Xie L, Ge X, Tan H, Xie L, Zhang Y, Hart T, Yang X, Bourne PE. Towards structural systems pharmacology to study complex diseases and personalized medicine. PLoS Comput Biol 2014; 10:e1003554. [PMID: 24830652 PMCID: PMC4022462 DOI: 10.1371/journal.pcbi.1003554] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Genome-Wide Association Studies (GWAS), whole genome sequencing, and high-throughput omics techniques have generated vast amounts of genotypic and molecular phenotypic data. However, these data have not yet been fully explored to improve the effectiveness and efficiency of drug discovery, which continues along a one-drug-one-target-one-disease paradigm. As a partial consequence, both the cost to launch a new drug and the attrition rate are increasing. Systems pharmacology and pharmacogenomics are emerging to exploit the available data and potentially reverse this trend, but, as we argue here, more is needed. To understand the impact of genetic, epigenetic, and environmental factors on drug action, we must study the structural energetics and dynamics of molecular interactions in the context of the whole human genome and interactome. Such an approach requires an integrative modeling framework for drug action that leverages advances in data-driven statistical modeling and mechanism-based multiscale modeling and transforms heterogeneous data from GWAS, high-throughput sequencing, structural genomics, functional genomics, and chemical genomics into unified knowledge. This is not a small task, but, as reviewed here, progress is being made towards the final goal of personalized medicines for the treatment of complex diseases.
Collapse
Affiliation(s)
- Lei Xie
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
- Ph.D. Program in Computer Science, Biology, and Biochemistry, The Graduate Center, The City University of New York, New York, New York, United States of America
- * E-mail:
| | - Xiaoxia Ge
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Hepan Tan
- Department of Computer Science, Hunter College, The City University of New York, New York, New York, United States of America
| | - Li Xie
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| | - Yinliang Zhang
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| | - Thomas Hart
- Department of Biological Sciences, Hunter College, The City University of New York, New York, New York, United States of America
| | - Xiaowei Yang
- School of Public Health, Hunter College, The City University of New York, New York, New York, United States of America
| | - Philip E. Bourne
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States of America
| |
Collapse
|
27
|
Abstract
Proteins are macromolecules that serve a cell’s myriad processes and functions in all living organisms via dynamic interactions with other proteins, small molecules and cellular components. Genetic variations in the protein-encoding regions of the human genome account for >85% of all known Mendelian diseases, and play an influential role in shaping complex polygenic diseases. Proteins also serve as the predominant target class for the design of small molecule drugs to modulate their activity. Knowledge of the shape and form of proteins, by means of their three-dimensional structures, is therefore instrumental to understanding their roles in disease and their potentials for drug development. In this chapter we outline, with the wide readership of non-structural biologists in mind, the various experimental and computational methods available for protein structure determination. We summarize how the wealth of structure information, contributed to a large extent by the technological advances in structure determination to date, serves as a useful tool to decipher the molecular basis of genetic variations for disease characterization and diagnosis, particularly in the emerging era of genomic medicine, and becomes an integral component in the modern day approach towards rational drug development.
Collapse
Affiliation(s)
- Nelson L.S. Tang
- Dept. of Chemical Pathology and Lab. of Genetics of Disease Suscept., The Chinese University of Hong Kong, Hong Kong, People's Republic of China
| | - Terence Poon
- Department of Paediatrics and Proteomics Laboratory, The Chinese University of Hong Kong, Hong Kong, People's Republic of China
| |
Collapse
|
28
|
Das D, Murzin AG, Rawlings ND, Finn RD, Coggill P, Bateman A, Godzik A, Aravind L. Structure and computational analysis of a novel protein with metallopeptidase-like and circularly permuted winged-helix-turn-helix domains reveals a possible role in modified polysaccharide biosynthesis. BMC Bioinformatics 2014; 15:75. [PMID: 24646163 PMCID: PMC4000134 DOI: 10.1186/1471-2105-15-75] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2013] [Accepted: 03/04/2014] [Indexed: 11/10/2022] Open
Abstract
Background CA_C2195 from Clostridium acetobutylicum is a protein of unknown function. Sequence analysis predicted that part of the protein contained a metallopeptidase-related domain. There are over 200 homologs of similar size in large sequence databases such as UniProt, with pairwise sequence identities in the range of ~40-60%. CA_C2195 was chosen for crystal structure determination for structure-based function annotation of novel protein sequence space. Results The structure confirmed that CA_C2195 contained an N-terminal metallopeptidase-like domain. The structure revealed two extra domains: an α+β domain inserted in the metallopeptidase-like domain and a C-terminal circularly permuted winged-helix-turn-helix domain. Conclusions Based on our sequence and structural analyses using the crystal structure of CA_C2195 we provide a view into the possible functions of the protein. From contextual information from gene-neighborhood analysis, we propose that rather than being a peptidase, CA_C2195 and its homologs might play a role in biosynthesis of a modified cell-surface carbohydrate in conjunction with several sugar-modification enzymes. These results provide the groundwork for the experimental verification of the function.
Collapse
Affiliation(s)
- Debanu Das
- Joint Center for Structural Genomics, La Jolla, CA, USA.
| | | | | | | | | | | | | | | |
Collapse
|
29
|
Jahandideh S, Jaroszewski L, Godzik A. Improving the chances of successful protein structure determination with a random forest classifier. ACTA CRYSTALLOGRAPHICA. SECTION D, BIOLOGICAL CRYSTALLOGRAPHY 2014; 70:627-35. [PMID: 24598732 PMCID: PMC3949519 DOI: 10.1107/s1399004713032070] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2013] [Accepted: 11/25/2013] [Indexed: 01/29/2023]
Abstract
Obtaining diffraction quality crystals remains one of the major bottlenecks in structural biology. The ability to predict the chances of crystallization from the amino-acid sequence of the protein can, at least partly, address this problem by allowing a crystallographer to select homologs that are more likely to succeed and/or to modify the sequence of the target to avoid features that are detrimental to successful crystallization. In 2007, the now widely used XtalPred algorithm [Slabinski et al. (2007), Protein Sci. 16, 2472-2482] was developed. XtalPred classifies proteins into five `crystallization classes' based on a simple statistical analysis of the physicochemical features of a protein. Here, towards the same goal, advanced machine-learning methods are applied and, in addition, the predictive potential of additional protein features such as predicted surface ruggedness, hydrophobicity, side-chain entropy of surface residues and amino-acid composition of the predicted protein surface are tested. The new XtalPred-RF (random forest) achieves significant improvement of the prediction of crystallization success over the original XtalPred. To illustrate this, XtalPred-RF was tested by revisiting target selection from 271 Pfam families targeted by the Joint Center for Structural Genomics (JCSG) in PSI-2, and it was estimated that the number of targets entered into the protein-production and crystallization pipeline could have been reduced by 30% without lowering the number of families for which the first structures were solved. The prediction improvement depends on the subset of targets used as a testing set and reaches 100% (i.e. twofold) for the top class of predicted targets.
Collapse
Affiliation(s)
- Samad Jahandideh
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA
- Joint Center for Structural Genomics, http://www.jcsg.org/, USA
| | - Lukasz Jaroszewski
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA
- Joint Center for Structural Genomics, http://www.jcsg.org/, USA
- Center for Research in Biological Systems (CRBS), University of California, San Diego, La Jolla, California USA
| | - Adam Godzik
- Bioinformatics and Systems Biology Program, Sanford-Burnham Medical Research Institute, 10901 North Torrey Pines Road, La Jolla, CA 92307, USA
- Joint Center for Structural Genomics, http://www.jcsg.org/, USA
- Center for Research in Biological Systems (CRBS), University of California, San Diego, La Jolla, California USA
| |
Collapse
|
30
|
Huang YJ, Acton TB, Montelione GT. DisMeta: a meta server for construct design and optimization. Methods Mol Biol 2014; 1091:3-16. [PMID: 24203321 DOI: 10.1007/978-1-62703-691-7_1] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Intrinsically disordered or unstructured regions in proteins are both common and biologically important, particularly in regulation, signaling, and modulating intermolecular recognition processes. From a practical point of view, however, such disordered regions often can pose significant challenges for crystallization. Disordered regions are also detrimental to NMR spectral quality, complicating the analysis of resonance assignments and three-dimensional protein structures by NMR methods. The DisMeta Server has been used by Northeastern Structural Genomics (NESG) consortium as a primary tool for construct design and optimization in preparing samples for both NMR and crystallization studies. It is a meta-server that generates a consensus analysis of eight different sequence-based disorder predictors to identify regions that are likely to be disordered. DisMeta also identifies predicted secretion signal peptides, transmembrane segments, and low-complexity regions. Identification of disordered regions, by either experimental or computational methods, is an important step in the NESG structure production pipeline, allowing the rational design of protein constructs that have improved expression and solubility, improved crystallization, and better quality NMR spectra.
Collapse
Affiliation(s)
- Yuanpeng Janet Huang
- Center for Advanced Biotechnology and Medicine, Northeast Structural Genomics Consortium, Rutgers University, Piscataway, NJ, USA
| | | | | |
Collapse
|
31
|
Abstract
More than 20% of all protein domains are currently annotated as “domains of unknown function” (DUFs). About 2,700 DUFs are found in bacteria compared with just over 1,500 in eukaryotes. Over 800 DUFs are shared between bacteria and eukaryotes, and about 300 of these are also present in archaea. A total of 2,786 bacterial Pfam domains even occur in animals, including 320 DUFs. Evolutionary conservation suggests that many of these DUFs are important. Here we show that 355 essential proteins in 16 model bacterial species contain 238 DUFs, most of which represent single-domain proteins, clearly establishing the biological essentiality of DUFs. We suggest that experimental research should focus on conserved and essential DUFs (eDUFs) for functional analysis given their important function and wide taxonomic distribution, including bacterial pathogens. The functional units of proteins are domains. Typically, each domain has a distinct structure and function. Genomes encode thousands of domains, and many of the domains have no known function (domains of unknown function [DUFs]). They are often ignored as of little relevance, given that many of them are found in only a few genomes. Here we show that many DUFs are essential DUFs (eDUFs) based on their presence in essential proteins. We also show that eDUFs are often essential even if they are found in relatively few genomes. However, in general, more common DUFs are more often essential than rare DUFs.
Collapse
|
32
|
Di Domenico T, Potenza E, Walsh I, Parra RG, Giollo M, Minervini G, Piovesan D, Ihsan A, Ferrari C, Kajava AV, Tosatto SCE. RepeatsDB: a database of tandem repeat protein structures. Nucleic Acids Res 2013; 42:D352-7. [PMID: 24311564 PMCID: PMC3964956 DOI: 10.1093/nar/gkt1175] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open
Abstract
RepeatsDB (http://repeatsdb.bio.unipd.it/) is a database of annotated tandem repeat protein structures. Tandem repeats pose a difficult problem for the analysis of protein structures, as the underlying sequence can be highly degenerate. Several repeat types haven been studied over the years, but their annotation was done in a case-by-case basis, thus making large-scale analysis difficult. We developed RepeatsDB to fill this gap. Using state-of-the-art repeat detection methods and manual curation, we systematically annotated the Protein Data Bank, predicting 10 745 repeat structures. In all, 2797 structures were classified according to a recently proposed classification schema, which was expanded to accommodate new findings. In addition, detailed annotations were performed in a subset of 321 proteins. These annotations feature information on start and end positions for the repeat regions and units. RepeatsDB is an ongoing effort to systematically classify and annotate structural protein repeats in a consistent way. It provides users with the possibility to access and download high-quality datasets either interactively or programmatically through web services.
Collapse
Affiliation(s)
- Tomás Di Domenico
- Department of Biomedical Sciences, University of Padua, 35131 Padova, Italy, Department of Biological Chemistry, Universidad de Buenos Aires, Buenos Aires C1428EGA, Argentina, Department of Information Engineering, University of Padua, 35121 Padova, Italy, Department of Biosciences, COMSATS Institute of Information Technology, Sahiwal, Pakistan, Centre de Recherches de Biochimie Macromoléculaire, CNRS, 34293 Montpellier Cedex 5, France and Institut de Biologie Computationnelle, 34293 Montpellier Cedex 5, France
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Pulavarti SVSRK, Eletsky A, Lee HW, Acton TB, Xiao R, Everett JK, Prestegard JH, Montelione GT, Szyperski T. Solution NMR structure of CD1104B from pathogenic Clostridium difficile reveals a distinct α-helical architecture and provides first structural representative of protein domain family PF14203. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2013; 14:155-160. [PMID: 24048810 PMCID: PMC3844015 DOI: 10.1007/s10969-013-9164-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 09/10/2013] [Indexed: 05/30/2023]
Abstract
A high-quality structure of the 68-residue protein CD1104B from Clostridium difficile strain 630 exhibits a distinct all α-helical fold. The structure presented here is the first representative of bacterial protein domain family PF14203 (currently 180 members) of unknown function (DUF4319) and reveals that the side-chains of the only two strictly conserved residues (Glu 8 and Lys 48) form a salt bridge. Moreover, these two residues are located in the vicinity of the largest surface cleft which is predicted to contribute to a surface area involved in protein-protein interactions. This, along with its coding in transposon CTn4, suggests that CD1104B (and very likely all members of Pfam 14203) functions by interacting with other proteins required for the transfer of transposons between different bacterial species.
Collapse
Affiliation(s)
- Surya VSRK Pulavarti
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| | - Alexander Eletsky
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| | - Hsiau-Wei Lee
- Complex Carbohydrate Research Center, University at Georgia, and Northeast Structural Genomics Consortium, Athens, GA 30602, USA
| | - Thomas B. Acton
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - Rong Xiao
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - John K. Everett
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - James H. Prestegard
- Complex Carbohydrate Research Center, University at Georgia, and Northeast Structural Genomics Consortium, Athens, GA 30602, USA
| | - Gaetano T. Montelione
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA, Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, UMDNJ, Piscataway NJ 08854, USA
| | - Thomas Szyperski
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| |
Collapse
|
34
|
Bruni R, Kloss B. High-throughput cloning and expression of integral membrane proteins in Escherichia coli. CURRENT PROTOCOLS IN PROTEIN SCIENCE 2013; 74:29.6.1-29.6.34. [PMID: 24510647 PMCID: PMC3920300 DOI: 10.1002/0471140864.ps2906s74] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Recently, several structural genomics centers have been established and a remarkable number of three-dimensional structures of soluble proteins have been solved. For membrane proteins, the number of structures solved has been significantly trailing those for their soluble counterparts, not least because over-expression and purification of membrane proteins is a much more arduous process. By using high-throughput technologies, a large number of membrane protein targets can be screened simultaneously and a greater number of expression and purification conditions can be employed, leading to a higher probability of successfully determining the structure of membrane proteins. This unit describes the cloning, expression, and screening of membrane proteins using high-throughput methodologies developed in the laboratory. Basic Protocol 1 describes cloning of inserts into expression vectors by ligation-independent cloning. Basic Protocol 2 describes the expression and purification of the target proteins on a miniscale. Lastly, for the targets that do express on the miniscale, Basic Protocols 3 and 4 outline the methods employed for the expression and purification of targets on a midi-scale, as well as a procedure for detergent screening and identification of detergent(s) in which the target protein is stable.
Collapse
Affiliation(s)
- Renato Bruni
- New York Consortium on Membrane Protein Structure (NYCOMPS), New York Structural Biology Center (NYSBC), New York
| | - Brian Kloss
- New York Consortium on Membrane Protein Structure (NYCOMPS), New York Structural Biology Center (NYSBC), New York
| |
Collapse
|
35
|
DePietro PJ, Julfayev ES, McLaughlin WA. Quantification of the impact of PSI:Biology according to the annotations of the determined structures. BMC STRUCTURAL BIOLOGY 2013; 13:24. [PMID: 24139526 PMCID: PMC4016320 DOI: 10.1186/1472-6807-13-24] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/14/2013] [Accepted: 10/14/2013] [Indexed: 11/23/2022]
Abstract
Background Protein Structure Initiative:Biology (PSI:Biology) is the third phase of PSI where protein structures are determined in high-throughput to characterize their biological functions. The transition to the third phase entailed the formation of PSI:Biology Partnerships which are composed of structural genomics centers and biomedical science laboratories. We present a method to examine the impact of protein structures determined under the auspices of PSI:Biology by measuring their rates of annotations. The mean numbers of annotations per structure and per residue are examined. These are designed to provide measures of the amount of structure to function connections that can be leveraged from each structure. Results One result is that PSI:Biology structures are found to have a higher rate of annotations than structures determined during the first two phases of PSI. A second result is that the subset of PSI:Biology structures determined through PSI:Biology Partnerships have a higher rate of annotations than those determined exclusive of those partnerships. Both results hold when the annotation rates are examined either at the level of the entire protein or for annotations that are known to fall at specific residues within the portion of the protein that has a determined structure. Conclusions We conclude that PSI:Biology determines structures that are estimated to have a higher degree of biomedical interest than those determined during the first two phases of PSI based on a broad array of biomedical annotations. For the PSI:Biology Partnerships, we see that there is an associated added value that represents part of the progress toward the goals of PSI:Biology. We interpret the added value to mean that team-based structural biology projects that utilize the expertise and technologies of structural genomics centers together with biological laboratories in the community are conducted in a synergistic manner. We show that the annotation rates can be used in conjunction with established metrics, i.e. the numbers of structures and impact of publication records, to monitor the progress of PSI:Biology towards its goals of examining structure to function connections of high biomedical relevance. The metric provides an objective means to quantify the overall impact of PSI:Biology as it uses biomedical annotations from external sources.
Collapse
Affiliation(s)
| | | | - William A McLaughlin
- Department of Basic Science, The Commonwealth Medical College, 525 Pine Street, Scranton, PA 18509, USA.
| |
Collapse
|
36
|
Jalencas X, Mestres J. Identification of Similar Binding Sites to Detect Distant Polypharmacology. Mol Inform 2013; 32:976-90. [PMID: 27481143 DOI: 10.1002/minf.201300082] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Accepted: 07/29/2013] [Indexed: 01/19/2023]
Abstract
The ability of small molecules to interact with multiple proteins is referred to as polypharmacology. This property is often linked to the therapeutic action of drugs but it is known also to be responsible for many of their side effects. Because of its importance, the development of computational methods that can predict drug polypharmacology has become an important line of research that led recently to the identification of many novel targets for known drugs. Nowadays, the majority of these methods are based on measuring the similarity of a query molecule against the hundreds of thousands of molecules for which pharmacological data on thousands of proteins are available in public sources. However, similarity-based methods are inherently biased by the chemical coverage offered by the active molecules present in those public repositories, which limits significantly their capacity to predict interactions with proteins structurally and functionally unrelated to any of the already known targets for drugs. It is in this respect that structure-based methods aiming at identifying similar binding sites may offer an alternative complementary means to ligand-based methods for detecting distant polypharmacology. The different existing approaches to binding site detection, representation, comparison, and fragmentation are reviewed and recent successful applications presented.
Collapse
Affiliation(s)
- Xavier Jalencas
- Systems Pharmacology, Research Program on Biomedical Informatics (GRIB), IMIM Hospital del Mar Research Institute & University Pompeu Fabra, Parc de Recerca Biomèdica, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain fax: +34 93 3160550
| | - Jordi Mestres
- Systems Pharmacology, Research Program on Biomedical Informatics (GRIB), IMIM Hospital del Mar Research Institute & University Pompeu Fabra, Parc de Recerca Biomèdica, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain fax: +34 93 3160550.
| |
Collapse
|
37
|
Mistry J, Kloppmann E, Rost B, Punta M. An estimated 5% of new protein structures solved today represent a new Pfam family. ACTA CRYSTALLOGRAPHICA SECTION D: BIOLOGICAL CRYSTALLOGRAPHY 2013; 69:2186-93. [PMID: 24189229 PMCID: PMC3817691 DOI: 10.1107/s0907444913027157] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2013] [Accepted: 10/02/2013] [Indexed: 01/09/2023]
Abstract
High-resolution structural knowledge is key to understanding how proteins function at the molecular level. The number of entries in the Protein Data Bank (PDB), the repository of all publicly available protein structures, continues to increase, with more than 8000 structures released in 2012 alone. The authors of this article have studied how structural coverage of the protein-sequence space has changed over time by monitoring the number of Pfam families that acquired their first representative structure each year from 1976 to 2012. Twenty years ago, for every 100 new PDB entries released, an estimated 20 Pfam families acquired their first structure. By 2012, this decreased to only about five families per 100 structures. The reasons behind the slower pace at which previously uncharacterized families are being structurally covered were investigated. It was found that although more than 50% of current Pfam families are still without a structural representative, this set is enriched in families that are small, functionally uncharacterized or rich in problem features such as intrinsically disordered and transmembrane regions. While these are important constraints, the reasons why it may not yet be time to give up the pursuit of a targeted but more comprehensive structural coverage of the protein-sequence space are discussed.
Collapse
Affiliation(s)
- Jaina Mistry
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, England
| | | | | | | |
Collapse
|
38
|
Pulavarti SVSRK, He Y, Feldmann EA, Eletsky A, Acton TB, Xiao R, Everett JK, Montelione GT, Kennedy MA, Szyperski T. Solution NMR structures provide first structural coverage of the large protein domain family PF08369 and complementary structural coverage of dark operative protochlorophyllide oxidoreductase complexes. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2013; 14:119-126. [PMID: 23963952 PMCID: PMC3982801 DOI: 10.1007/s10969-013-9159-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/05/2013] [Accepted: 07/16/2013] [Indexed: 06/02/2023]
Abstract
High-quality NMR structures of the C-terminal domain comprising residues 484-537 of the 537-residue protein Bacterial chlorophyll subunit B (BchB) from Chlorobium tepidum and residues 9-61 of 61-residue Asr4154 from Nostoc sp. (strain PCC 7120) exhibit a mixed α/β fold comprised of three α-helices and a small β-sheet packed against second α-helix. These two proteins share 29% sequence similarity and their structures are globally quite similar. The structures of BchB(484-537) and Asr4154(9-61) are the first representative structures for the large protein family (Pfam) PF08369, a family of unknown function currently containing 610 members in bacteria and eukaryotes. Furthermore, BchB(484-537) complements the structural coverage of the dark-operating protochlorophyllide oxidoreductase.
Collapse
Affiliation(s)
- Surya VSRK Pulavarti
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| | - Yunfen He
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| | - Erik A. Feldmann
- Department of Chemistry and Biochemistry, Miami University, and Northeast Structural Genomics Consortium, Oxford, OH 45056, USA
| | - Alexander Eletsky
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| | - Thomas B. Acton
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - Rong Xiao
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - John K. Everett
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - Gaetano T. Montelione
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA; Department of Biochemistry and Molecular Biology, Robert Wood Johnson Medical School, UMDNJ, Piscataway NJ 08854, USA
| | - Michael A. Kennedy
- Department of Chemistry and Biochemistry, Miami University, and Northeast Structural Genomics Consortium, Oxford, OH 45056, USA
| | - Thomas Szyperski
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| |
Collapse
|
39
|
Serrano P, Geralt M, Mohanty B, Wüthrich K. Structural representative of the protein family PF14466 has a new fold and establishes links with the C2 and PLAT domains from the widely distant Pfams PF00168 and PF01477. Protein Sci 2013; 22:1000-7. [PMID: 23681886 DOI: 10.1002/pro.2284] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Revised: 04/29/2013] [Accepted: 05/01/2013] [Indexed: 11/12/2022]
Abstract
The domain of unknown function (DUF) YP_001302112.1, a protein secreted by the human intestinal microbita, has been determined by NMR and represents the first structure for the Pfam PF14466. Its NMR structure is classified as a new fold, which, nonetheless, shows limited similarities with representatives of the PLAT/LH2 domains from PF01477 and the C2 domains from PF00168, both of which bind Ca(2+) for their physiological functions. Further experiments revealed affinity of YP_001302112.1 for Ca(2+), and the NMR structure in the presence of CaCl2 was better defined than that of the apo-protein. Overall, these NMR structures establish a new connection between structural representatives from two widely different Pfams that include the calcium-binding domain of a sialidase from Vibrio cholerae and the α-toxin from Clostridium perfrigens, whereby these two proteins have only 7% sequence identity. Furthermore, it provides information toward the functional annotation of YP_001302112.1, based on its capacity to bind Ca(2+), and thus adds to the structural and functional coverage of the protein sequence universe.
Collapse
Affiliation(s)
- Pedro Serrano
- Joint Center for Structural Genomics, La Jolla, California, USA
| | | | | | | |
Collapse
|
40
|
Shirota M, Kinoshita K. Analyses of the general rule on residue pair frequencies in local amino acid sequences of soluble, ordered proteins. Protein Sci 2013; 22:725-33. [PMID: 23526551 DOI: 10.1002/pro.2255] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2012] [Revised: 01/26/2013] [Accepted: 03/14/2013] [Indexed: 11/10/2022]
Abstract
The amino acid sequences of soluble, ordered proteins with stable structures have evolved due to biological and physical requirements, thus distinguishing them from random sequences. Previous analyses have focused on extracting the features that frequently appear in protein substructures, such as α-helix and β-sheet, but the universal features of protein sequences have not been addressed. To clarify the differences between native protein sequences and random sequences, we analyzed 7368 soluble, ordered protein sequences, by inspecting the observed and expected occurrences of 400 amino acid pairs in local proximity, up to 10 residues along the sequence in comparison with their expected occurrence in random sequence. We found the trend that the hydrophobic residue pairs and the polar residue pairs are significantly decreased, whereas the pairs between a hydrophobic residue and a polar residue are increased. This trend was universally observed regardless of the secondary structure content but was not observed in protein sequences that include intrinsically disordered regions, indicating that it can be a general rule of protein foldability. The possible benefits of this rule are discussed from the viewpoints of protein aggregation and disorder, which are both caused by low-complexity regions of hydrophobic or polar residues.
Collapse
Affiliation(s)
- Matsuyuki Shirota
- Department of Applied Information Sciences, Graduate School of Information Sciences, Tohoku University, Sendai, Miyagi, Japan.
| | | |
Collapse
|
41
|
Functional site plasticity in domain superfamilies. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1834:874-89. [PMID: 23499848 PMCID: PMC3787744 DOI: 10.1016/j.bbapap.2013.02.042] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Revised: 02/20/2013] [Accepted: 02/28/2013] [Indexed: 11/21/2022]
Abstract
We present, to our knowledge, the first quantitative analysis of functional site diversity in homologous domain superfamilies. Different types of functional sites are considered separately. Our results show that most diverse superfamilies are very plastic in terms of the spatial location of their functional sites. This is especially true for protein–protein interfaces. In contrast, we confirm that catalytic sites typically occupy only a very small number of topological locations. Small-ligand binding sites are more diverse than expected, although in a more limited manner than protein–protein interfaces. In spite of the observed diversity, our results also confirm the previously reported preferential location of functional sites. We identify a subset of homologous domain superfamilies where diversity is particularly extreme, and discuss possible reasons for such plasticity, i.e. structural diversity. Our results do not contradict previous reports of preferential co-location of sites among homologues, but rather point at the importance of not ignoring other sites, especially in large and diverse superfamilies. Data on sites exploited by different relatives, within each well annotated domain superfamily, has been made accessible from the CATH website in order to highlight versatile superfamilies or superfamilies with highly preferential sites. This information is valuable for system biology and knowledge of any constraints on protein interactions could help in understanding the dynamic control of networks in which these proteins participate. The novelty of our work lies in the comprehensive nature of the analysis – we have used a significantly larger dataset than previous studies – and the fact that in many superfamilies we show that different parts of the domain surface are exploited by different relatives for ligand/protein interactions, particularly in superfamilies which are diverse in sequence and structure, an observation not previously reported on such a large scale. This article is part of a Special Issue entitled: The emerging dynamic view of proteins: Protein plasticity in allostery, evolution and self-assembly. Most diverse domain superfamilies have very diverse functional site locations. Catalytic sites are found in a small, restricted number of topological positions. Location of small-ligand binding sites is more diverse than expected. Protein–protein interfaces display the most flexibility in functional site locations.
Collapse
|
42
|
Mills JL, Acton TB, Xiao R, Everett JK, Montelione GT, Szyperski T. Solution NMR structure of the helicase associated domain BVU_0683(627-691) from Bacteroides vulgatus provides first structural coverage for protein domain family PF03457 and indicates domain binding to DNA. JOURNAL OF STRUCTURAL AND FUNCTIONAL GENOMICS 2013; 14:19-24. [PMID: 23160728 PMCID: PMC3637686 DOI: 10.1007/s10969-012-9148-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2012] [Accepted: 10/29/2012] [Indexed: 06/01/2023]
Abstract
A high-quality NMR structure of the helicase associated (HA) domain comprising residues 627-691 of the 753-residue protein BVU_0683 from Bacteroides vulgatus exhibits an all α-helical fold. The structure presented here is the first representative for the large protein domain family PF03457 (currently 742 members) of HA domains. Comparison with structurally similar proteins supports the hypothesis that HA domains bind to DNA and that binding specificity varies greatly within the family of HA domains constituting PF03457.
Collapse
Affiliation(s)
- Jeffrey L. Mills
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| | - Thomas B. Acton
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - Rong Xiao
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - John K. Everett
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA
| | - Gaetano T. Montelione
- Center of Advanced Biotechnology and Medicine and Department of Molecular Biology and Biochemistry, Rutgers, The State University of New Jersey and Northeast Structural Genomics Consortium, Piscataway, NJ 08854, USA, Department of Biochemistry, Robert Wood Johnson Medical School, UMDNJ, Piscataway, NJ 08854, USA
| | - Thomas Szyperski
- Department of Chemistry, The State University of New York at Buffalo, and Northeast Structural Genomics Consortium, Buffalo, NY 14260, USA
| |
Collapse
|
43
|
Kasahara K, Shirota M, Kinoshita K. Comprehensive classification and diversity assessment of atomic contacts in protein-small ligand interactions. J Chem Inf Model 2012. [PMID: 23186137 DOI: 10.1021/ci300377f] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Elucidating the molecular mechanisms of selective ligand recognition by proteins is a long-standing problem in drug discovery. Rapid increase in the availability of three-dimensional protein structural data indicates that a data-driven approach for finding the rules that govern protein-ligand interactions is increasingly attractive. However, this approach is not straightforward because of the complexity of molecular interactions and our inadequate understanding of the diversity of molecular interactions that occur during ligand recognition. Thus, we aimed to provide a comprehensive classification of the spatial arrangements of ligand atoms based on the local coordinates of each interacting "protein fragment" consisting of three atoms with covalent bonds in each amino acid. We used a pattern recognition technique based on the Gaussian mixture model and found 13,519 patterns in the spatial arrangements of interacting ligand atoms, each of which was described as a Gaussian function of the local coordinates. Some typical well-known interaction patterns such as hydrogen bonds were ubiquitous in several hundred protein families, whereas others were only observed in a few specific protein families. After removing protein sequence redundancy from the data set, we found that 63.4% of ligand atoms interacted via one or more interaction patterns and that 25.7% of ligand atoms interacted without patterns, whereas the remainder had no direct interactions. The top 3115 major patterns included 90% of the interacting pairs of residues and ligand atoms with patterns, while the top 6229 included all of them.
Collapse
Affiliation(s)
- Kota Kasahara
- Department of Applied Information Sciences, Graduate School of Information Sciences, Tohoku University, Miyagi 980-8597, Japan
| | | | | |
Collapse
|
44
|
Oldfield CJ, Xue B, Van YY, Ulrich EL, Markley JL, Dunker AK, Uversky VN. Utilization of protein intrinsic disorder knowledge in structural proteomics. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2012; 1834:487-98. [PMID: 23232152 DOI: 10.1016/j.bbapap.2012.12.003] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2012] [Revised: 12/02/2012] [Accepted: 12/03/2012] [Indexed: 12/01/2022]
Abstract
Intrinsically disordered proteins (IDPs) and proteins with long disordered regions are highly abundant in various proteomes. Despite their lack of well-defined ordered structure, these proteins and regions are frequently involved in crucial biological processes. Although in recent years these proteins have attracted the attention of many researchers, IDPs represent a significant challenge for structural characterization since these proteins can impact many of the processes in the structure determination pipeline. Here we investigate the effects of IDPs on the structure determination process and the utility of disorder prediction in selecting and improving proteins for structural characterization. Examination of the extent of intrinsic disorder in existing crystal structures found that relatively few protein crystal structures contain extensive regions of intrinsic disorder. Although intrinsic disorder is not the only cause of crystallization failures and many structured proteins cannot be crystallized, filtering out highly disordered proteins from structure-determination target lists is still likely to be cost effective. Therefore it is desirable to avoid highly disordered proteins from structure-determination target lists and we show that disorder prediction can be applied effectively to enrich structure determination pipelines with proteins more likely to yield crystal structures. For structural investigation of specific proteins, disorder prediction can be used to improve targets for structure determination. Finally, a framework for considering intrinsic disorder in the structure determination pipeline is proposed.
Collapse
Affiliation(s)
- Christopher J Oldfield
- Center for Computational Biology and Bioinformatics, Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, Indiana 46202, USA.
| | | | | | | | | | | | | |
Collapse
|
45
|
Mashiyama ST, Koupparis K, Caffrey CR, McKerrow JH, Babbitt PC. A global comparison of the human and T. brucei degradomes gives insights about possible parasite drug targets. PLoS Negl Trop Dis 2012; 6:e1942. [PMID: 23236535 PMCID: PMC3516576 DOI: 10.1371/journal.pntd.0001942] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2012] [Accepted: 10/23/2012] [Indexed: 01/26/2023] Open
Abstract
We performed a genome-level computational study of sequence and structure similarity, the latter using crystal structures and models, of the proteases of Homo sapiens and the human parasite Trypanosoma brucei. Using sequence and structure similarity networks to summarize the results, we constructed global views that show visually the relative abundance and variety of proteases in the degradome landscapes of these two species, and provide insights into evolutionary relationships between proteases. The results also indicate how broadly these sequence sets are covered by three-dimensional structures. These views facilitate cross-species comparisons and offer clues for drug design from knowledge about the sequences and structures of potential drug targets and their homologs. Two protease groups (“M32” and “C51”) that are very different in sequence from human proteases are examined in structural detail, illustrating the application of this global approach in mining new pathogen genomes for potential drug targets. Based on our analyses, a human ACE2 inhibitor was selected for experimental testing on one of these parasite proteases, TbM32, and was shown to inhibit it. These sequence and structure data, along with interactive versions of the protein similarity networks generated in this study, are available at http://babbittlab.ucsf.edu/resources.html. Human African trypanosomiasis (HAT) is caused by the protozoan parasite Trypanosoma brucei. HAT is fatal unless treated, yet the current treatment itself can cause death. New treatments are urgently needed. Our study focuses on proteases, which are enzymes that break down proteins. Because of their roles in many centrally important biological processes, proteases are targets for drugs to treat a variety of diseases including parasite infection. The recent explosion of protein sequence and structure information in public databases has made surveys of proteins on a genomic scale possible. However, collecting specific data of interest from diverse databases and synthesizing them in a way that is easy to interpret can be difficult. We used T. brucei and human protease sequences, crystal structures, and models to create network views that show how proteases cluster by similarity. Such views are valuable not only for understanding the evolution of the protein repertoire in each species, but also can give important clues for drug design. Two T. brucei protease groups (“M32” and “C51”) that are very different in sequence from human proteases were examined in structural detail. Based on our analyses, a human ACE2 inhibitor was selected for experimental testing on one of these parasite proteases, TbM32, and was shown to inhibit it.
Collapse
Affiliation(s)
- Susan T. Mashiyama
- Department of Bioengineering and Therapeutic Sciences, California Institute for Quantitative Biomedical Research (QB3), University of California San Francisco, San Francisco, California, United States of America
- Center for Discovery and Innovation in Parasitic Diseases, and Department of Pathology, QB3, University of California San Francisco, San Francisco, California, United States of America
| | - Kyriacos Koupparis
- Center for Discovery and Innovation in Parasitic Diseases, and Department of Pathology, QB3, University of California San Francisco, San Francisco, California, United States of America
| | - Conor R. Caffrey
- Center for Discovery and Innovation in Parasitic Diseases, and Department of Pathology, QB3, University of California San Francisco, San Francisco, California, United States of America
| | - James H. McKerrow
- Center for Discovery and Innovation in Parasitic Diseases, and Department of Pathology, QB3, University of California San Francisco, San Francisco, California, United States of America
- * E-mail: (JHM); (PCB)
| | - Patricia C. Babbitt
- Department of Bioengineering and Therapeutic Sciences, California Institute for Quantitative Biomedical Research (QB3), University of California San Francisco, San Francisco, California, United States of America
- * E-mail: (JHM); (PCB)
| |
Collapse
|
46
|
Tiwari MK, Singh R, Singh RK, Kim IW, Lee JK. Computational approaches for rational design of proteins with novel functionalities. Comput Struct Biotechnol J 2012; 2:e201209002. [PMID: 24688643 PMCID: PMC3962203 DOI: 10.5936/csbj.201209002] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2012] [Revised: 08/17/2012] [Accepted: 08/23/2012] [Indexed: 11/22/2022] Open
Abstract
Proteins are the most multifaceted macromolecules in living systems and have various important functions, including structural, catalytic, sensory, and regulatory functions. Rational design of enzymes is a great challenge to our understanding of protein structure and physical chemistry and has numerous potential applications. Protein design algorithms have been applied to design or engineer proteins that fold, fold faster, catalyze, catalyze faster, signal, and adopt preferred conformational states. The field of de novo protein design, although only a few decades old, is beginning to produce exciting results. Developments in this field are already having a significant impact on biotechnology and chemical biology. The application of powerful computational methods for functional protein designing has recently succeeded at engineering target activities. Here, we review recently reported de novo functional proteins that were developed using various protein design approaches, including rational design, computational optimization, and selection from combinatorial libraries, highlighting recent advances and successes.
Collapse
Affiliation(s)
- Manish Kumar Tiwari
- Department of Chemical Engineering, Konkuk University, 1 Hwayang-Dong, Gwangjin-Gu, Seoul 143-701, Korea ; These authors contributed equally
| | - Ranjitha Singh
- Department of Chemical Engineering, Konkuk University, 1 Hwayang-Dong, Gwangjin-Gu, Seoul 143-701, Korea ; These authors contributed equally
| | - Raushan Kumar Singh
- Department of Chemical Engineering, Konkuk University, 1 Hwayang-Dong, Gwangjin-Gu, Seoul 143-701, Korea
| | - In-Won Kim
- Department of Chemical Engineering, Konkuk University, 1 Hwayang-Dong, Gwangjin-Gu, Seoul 143-701, Korea
| | - Jung-Kul Lee
- Department of Chemical Engineering, Konkuk University, 1 Hwayang-Dong, Gwangjin-Gu, Seoul 143-701, Korea ; Institute of SK-KU Biomaterials, Konkuk University, 1 Hwayang-Dong, Gwangjin-Gu, Seoul 143-701, Korea
| |
Collapse
|
47
|
Xu Q, Dunbrack RL. Assignment of protein sequences to existing domain and family classification systems: Pfam and the PDB. Bioinformatics 2012; 28:2763-72. [PMID: 22942020 DOI: 10.1093/bioinformatics/bts533] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Automating the assignment of existing domain and protein family classifications to new sets of sequences is an important task. Current methods often miss assignments because remote relationships fail to achieve statistical significance. Some assignments are not as long as the actual domain definitions because local alignment methods often cut alignments short. Long insertions in query sequences often erroneously result in two copies of the domain assigned to the query. Divergent repeat sequences in proteins are often missed. RESULTS We have developed a multilevel procedure to produce nearly complete assignments of protein families of an existing classification system to a large set of sequences. We apply this to the task of assigning Pfam domains to sequences and structures in the Protein Data Bank (PDB). We found that HHsearch alignments frequently scored more remotely related Pfams in Pfam clans higher than closely related Pfams, thus, leading to erroneous assignment at the Pfam family level. A greedy algorithm allowing for partial overlaps was, thus, applied first to sequence/HMM alignments, then HMM-HMM alignments and then structure alignments, taking care to join partial alignments split by large insertions into single-domain assignments. Additional assignment of repeat Pfams with weaker E-values was allowed after stronger assignments of the repeat HMM. Our database of assignments, presented in a database called PDBfam, contains Pfams for 99.4% of chains >50 residues. AVAILABILITY The Pfam assignment data in PDBfam are available at http://dunbrack2.fccc.edu/ProtCid/PDBfam, which can be searched by PDB codes and Pfam identifiers. They will be updated regularly.
Collapse
Affiliation(s)
- Qifang Xu
- Institute for Cancer Research, Fox Chase Cancer Center, 333 Cottman Avenue, Philadelphia, PA 19111, USA
| | | |
Collapse
|
48
|
Desaphy J, Azdimousa K, Kellenberger E, Rognan D. Comparison and druggability prediction of protein-ligand binding sites from pharmacophore-annotated cavity shapes. J Chem Inf Model 2012; 52:2287-99. [PMID: 22834646 DOI: 10.1021/ci300184x] [Citation(s) in RCA: 86] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Estimating the pairwise similarity of protein-ligand binding sites is a fast and efficient way of predicting cross-reactivity and putative side effects of drug candidates. Among the many tools available, three-dimensional (3D) alignment-dependent methods are usually slow and based on simplified representations of binding site atoms or surfaces. On the other hand, fast and efficient alignment-free methods have recently been described but suffer from a lack of interpretability. We herewith present a novel binding site description (VolSite), coupled to an alignment and comparison tool (Shaper) combining the speed of alignment-free methods with the interpretability of alignment-dependent approaches. It is based on the comparison of negative images of binding cavities encoding both shape and pharmacophoric properties at regularly spaced grid points. Shaper approximates the resulting molecular shape with a smooth Gaussian function and aligns protein binding sites by optimizing their volume overlap. Volsite and Shaper were successfully applied to compare protein-ligand binding sites and to predict their structural druggability.
Collapse
Affiliation(s)
- Jérémy Desaphy
- Laboratory of Therapeutic Innovation, UMR 7200 Université de Strasbourg/CNRS, Medalis Drug Discovery Center, F-67400 Illkirch, France
| | | | | | | |
Collapse
|
49
|
Feldmann EA, Seetharaman J, Ramelot TA, Lew S, Zhao L, Hamilton K, Ciccosanti C, Xiao R, Acton TB, Everett JK, Tong L, Montelione GT, Kennedy MA. Solution NMR and X-ray crystal structures of Pseudomonas syringae Pspto_3016 from protein domain family PF04237 (DUF419) adopt a "double wing" DNA binding motif. ACTA ACUST UNITED AC 2012; 13:155-62. [PMID: 22865330 DOI: 10.1007/s10969-012-9140-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2012] [Accepted: 07/03/2012] [Indexed: 01/13/2023]
Abstract
The protein Pspto_3016 is a 117-residue member of the protein domain family PF04237 (DUF419), which is to date a functionally uncharacterized family of proteins. In this report, we describe the structure of Pspto_3016 from Pseudomonas syringae solved by both solution NMR and X-ray crystallography at 2.5 Å resolution. In both cases, the structure of Pspto_3016 adopts a "double wing" α/β sandwich fold similar to that of protein YjbR from Escherichia coli and to the C-terminal DNA binding domain of the MotA transcription factor (MotCF) from T4 bacteriophage, along with other uncharacterized proteins. Pspto_3016 was selected by the Protein Structure Initiative of the National Institutes of Health and the Northeast Structural Genomics Consortium (NESG ID PsR293).
Collapse
Affiliation(s)
- Erik A Feldmann
- Department of Chemistry and Biochemistry, Miami University, Oxford, OH 45056, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
50
|
Kloppmann E, Punta M, Rost B. Structural genomics plucks high-hanging membrane proteins. Curr Opin Struct Biol 2012; 22:326-32. [PMID: 22622032 DOI: 10.1016/j.sbi.2012.05.002] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2012] [Revised: 03/28/2012] [Accepted: 05/01/2012] [Indexed: 01/21/2023]
Abstract
Recent years have seen the establishment of structural genomics centers that explicitly target integral membrane proteins. Here, we review the advances in targeting these extremely high-hanging fruits of structural biology in high-throughput mode. We observe that the experimental determination of high-resolution structures of integral membrane proteins is increasingly successful both in terms of getting structures and of covering important protein families, for example, from Pfam. Structural genomics has begun to contribute significantly toward this progress. An important component of this contribution is the set up of robotic pipelines that generate a wealth of experimental data for membrane proteins. We argue that prediction methods for the identification of membrane regions and for the comparison of membrane proteins largely suffice to meet the challenges of target selection for structural genomics of membrane proteins. In contrast, we need better methods to prioritize the most promising members in a family of closely related proteins and to annotate protein function from sequence and structure in absence of homology.
Collapse
Affiliation(s)
- Edda Kloppmann
- Department of Bioinformatics and Computational Biology, Technical University Munich, Germany.
| | | | | |
Collapse
|