1
|
Schaeffer RD, Zhang J, Medvedev KE, Kinch LN, Cong Q, Grishin NV. ECOD domain classification of 48 whole proteomes from AlphaFold Structure Database using DPAM2. PLoS Comput Biol 2024; 20:e1011586. [PMID: 38416793 PMCID: PMC10927120 DOI: 10.1371/journal.pcbi.1011586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 03/11/2024] [Accepted: 02/20/2024] [Indexed: 03/01/2024] Open
Abstract
Protein structure prediction has now been deployed widely across several different large protein sets. Large-scale domain annotation of these predictions can aid in the development of biological insights. Using our Evolutionary Classification of Protein Domains (ECOD) from experimental structures as a basis for classification, we describe the detection and cataloging of domains from 48 whole proteomes deposited in the AlphaFold Database. On average, we can provide positive classification (either of domains or other identifiable non-domain regions) for 90% of residues in all proteomes. We classified 746,349 domains from 536,808 proteins comprised of over 226,424,000 amino acid residues. We examine the varying populations of homologous groups in both eukaryotes and bacteria. In addition to containing a higher fraction of disordered regions and unassigned domains, eukaryotes show a higher proportion of repeated proteins, both globular and small repeats. We enumerate those highly populated domains that are shared in both eukaryotes and bacteria, such as the Rossmann domains, TIM barrels, and P-loop domains. Additionally, we compare the sampling of homologous groups from this whole proteome set against our stable ECOD reference and discuss groups that have been enriched by structure predictions. Finally, we discuss the implication of these results for protein target selection for future classification strategies for very large protein sets.
Collapse
Affiliation(s)
- R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Kirill E. Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| |
Collapse
|
2
|
Kinch LN, Schaeffer RD, Zhang J, Cong Q, Orth K, Grishin N. Insights into virulence: structure classification of the Vibrio parahaemolyticus RIMD mobilome. mSystems 2023; 8:e0079623. [PMID: 38014954 PMCID: PMC10734457 DOI: 10.1128/msystems.00796-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 10/17/2023] [Indexed: 11/29/2023] Open
Abstract
IMPORTANCE The pandemic Vpar strain RIMD causes seafood-borne illness worldwide. Previous comparative genomic studies have revealed pathogenicity islands in RIMD that contribute to the success of the strain in infection. However, not all virulence determinants have been identified, and many of the proteins encoded in known pathogenicity islands are of unknown function. Based on the EOCD database, we used evolution-based classification of structure models for the RIMD proteome to improve our functional understanding of virulence determinants acquired by the pandemic strain. We further identify and classify previously unknown mobile protein domains as well as fast evolving residue positions in structure models that contribute to virulence and adaptation with respect to a pre-pandemic strain. Our work highlights key contributions of phage in mediating seafood born illness, suggesting this strain balances its avoidance of phage predators with its successful colonization of human hosts.
Collapse
Affiliation(s)
- Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Kim Orth
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Nick Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
3
|
Medvedev KE, Schaeffer RD, Pei J, Grishin NV. Pathogenic mutation hotspots in protein kinase domain structure. Protein Sci 2023; 32:e4750. [PMID: 37572333 PMCID: PMC10464295 DOI: 10.1002/pro.4750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/04/2023] [Accepted: 08/08/2023] [Indexed: 08/14/2023]
Abstract
Control of eukaryotic cellular function is heavily reliant on the phosphorylation of proteins at specific amino acid residues, such as serine, threonine, tyrosine, and histidine. Protein kinases that are responsible for this process comprise one of the largest families of evolutionarily related proteins. Dysregulation of protein kinase signaling pathways is a frequent cause of a large variety of human diseases including cancer, autoimmune, neurodegenerative, and cardiovascular disorders. In this study, we mapped all pathogenic mutations in 497 human protein kinase domains from the ClinVar database to the reference structure of Aurora kinase A (AURKA) and grouped them by the relevance to the disease type. Our study revealed that the majority of mutation hotspots associated with cancer are situated within the catalytic and activation loops of the kinase domain, whereas non-cancer-related hotspots tend to be located outside of these regions. Additionally, we identified a hotspot at residue R371 of the AURKA structure that has the highest number of exclusively non-cancer-related pathogenic mutations (21) and has not been previously discussed.
Collapse
Affiliation(s)
- Kirill E. Medvedev
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - R. Dustin Schaeffer
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Jimin Pei
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Nick V. Grishin
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiochemistryUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
4
|
Medvedev KE, Schaeffer RD, Chen KS, Grishin NV. Pan-cancer structurome reveals overrepresentation of beta sandwiches and underrepresentation of alpha helical domains. Sci Rep 2023; 13:11988. [PMID: 37491511 PMCID: PMC10368619 DOI: 10.1038/s41598-023-39273-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 07/22/2023] [Indexed: 07/27/2023] Open
Abstract
The recent progress in the prediction of protein structures marked a historical milestone. AlphaFold predicted 200 million protein models with an accuracy comparable to experimental methods. Protein structures are widely used to understand evolution and to identify potential drug targets for the treatment of various diseases, including cancer. Thus, these recently predicted structures might convey previously unavailable information about cancer biology. Evolutionary classification of protein domains is challenging and different approaches exist. Recently our team presented a classification of domains from human protein models released by AlphaFold. Here we evaluated the pan-cancer structurome, domains from over and under expressed proteins in 21 cancer types, using the broadest levels of the ECOD classification: the architecture (A-groups) and possible homology (X-groups) levels. Our analysis reveals that AlphaFold has greatly increased the three-dimensional structural landscape for proteins that are differentially expressed in these 21 cancer types. We show that beta sandwich domains are significantly overrepresented and alpha helical domains are significantly underrepresented in the majority of cancer types. Our data suggest that the prevalence of the beta sandwiches is due to the high levels of immunoglobulins and immunoglobulin-like domains that arise during tumor development-related inflammation. On the other hand, proteins with exclusively alpha domains are important elements of homeostasis, apoptosis and transmembrane transport. Therefore cancer cells tend to reduce representation of these proteins to promote successful oncogeneses.
Collapse
Affiliation(s)
- Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Kenneth S Chen
- Department of Pediatrics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Children's Medical Center Research Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| |
Collapse
|
5
|
Schaeffer RD, Zhang J, Kinch LN, Pei J, Cong Q, Grishin NV. Classification of domains in predicted structures of the human proteome. Proc Natl Acad Sci U S A 2023; 120:e2214069120. [PMID: 36917664 PMCID: PMC10041065 DOI: 10.1073/pnas.2214069120] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 02/06/2023] [Indexed: 03/16/2023] Open
Abstract
Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website (http://prodata.swmed.edu/ecod/index_human.php).
Collapse
Affiliation(s)
- R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Lisa N. Kinch
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX75390
- HHMI, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Jimin Pei
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX75390
| | - Nick V. Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX75390
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX75390
| |
Collapse
|
6
|
Zhang J, Schaeffer RD, Durham J, Cong Q, Grishin NV. DPAM: A domain parser for AlphaFold models. Protein Sci 2023; 32:e4548. [PMID: 36539305 PMCID: PMC9850437 DOI: 10.1002/pro.4548] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 12/06/2022] [Accepted: 12/13/2022] [Indexed: 01/20/2023]
Abstract
The recent breakthroughs in structure prediction, where methods such as AlphaFold demonstrated near-atomic accuracy, herald a paradigm shift in structural biology. The 200 million high-accuracy models released in the AlphaFold Database are expected to guide protein science in the coming decades. Partitioning these AlphaFold models into domains and assigning them to an evolutionary hierarchy provide an efficient way to gain functional insights into proteins. However, classifying such a large number of predicted structures challenges the infrastructure of current structure classifications, including our Evolutionary Classification of protein Domains (ECOD). Better computational tools are urgently needed to parse and classify domains from AlphaFold models automatically. Here we present a Domain Parser for AlphaFold Models (DPAM) that can automatically recognize globular domains from these models based on inter-residue distances in 3D structures, predicted aligned errors, and ECOD domains found by sequence (HHsuite) and structural (Dali) similarity searches. Based on a benchmark of 18,759 AlphaFold models, we demonstrate that DPAM can recognize 98.8% of domains and assign correct boundaries for 87.5%, significantly outperforming structure-based domain parsers and homology-based domain assignment using ECOD domains found by HHsuite or Dali. Application of DPAM to the massive AlphaFold models will enable efficient classification of domains, providing evolutionary contexts and facilitating functional studies.
Collapse
Affiliation(s)
- Jing Zhang
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - R. Dustin Schaeffer
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Jesse Durham
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and DevelopmentUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Harold C. Simmons Comprehensive Cancer CenterUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| | - Nick V. Grishin
- Department of BiophysicsUniversity of Texas Southwestern Medical CenterDallasTexasUSA
- Department of BiochemistryUniversity of Texas Southwestern Medical CenterDallasTexasUSA
| |
Collapse
|
7
|
Schaeffer RD, Kinch L, Kryshtafovych A, Grishin NV. Assessment of domain interactions in the fourteenth round of the Critical Assessment of Structure Prediction (CASP14). Proteins 2021; 89:1700-1710. [PMID: 34455641 PMCID: PMC8616818 DOI: 10.1002/prot.26225] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 08/07/2021] [Accepted: 08/24/2021] [Indexed: 12/29/2022]
Abstract
The high accuracy of some CASP14 models at the domain level prompted a more detailed evaluation of structure predictions on whole targets. For the first time in critical assessment of structure prediction (CASP), we evaluated accuracy of difficult domain assembly in models submitted for multidomain targets where the community predicted individual evaluation units (EUs) with greater accuracy than full-length targets. Ten proteins with domain interactions that did not show evidence of conformational change and were not involved in significant oligomeric contacts were chosen as targets for the domain interaction assessment. Groups were ranked using complementary interaction scores (F1, QS score, and Jaccard coefficient), and their predictions were evaluated for their ability to correctly model inter-domain interfaces and overall protein folds. Target performance was broadly grouped into two clusters. The first consisted primarily of targets containing two EUs wherein predictors more broadly predicted domain positioning and interfacial contacts correctly. The other consisted of complex two-EU and three-EU targets where few predictors performed well. The highest ranked predictor, AlphaFold2, produced high-accuracy models on eight out of 10 targets. Their interdomain scores on three of these targets were significantly higher than all other groups and were responsible for their overall outperformance in the category. We further highlight the performance of AlphaFold2 and the next best group, BAKER-experimental on several interesting targets.
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Department of Biophysics, UT Southwestern Medical Center, Dallas, Texas, USA
| | - Lisa Kinch
- Howard Hughes Medical Institute, UT Southwestern Medical Center, Dallas, Texas, USA
| | - Andriy Kryshtafovych
- Protein Structure Prediction Center, Genome and Biomedical Sciences Facilities, University of California, Davis, California, USA
| | - Nick V Grishin
- Department of Biophysics, UT Southwestern Medical Center, Dallas, Texas, USA.,Howard Hughes Medical Institute, UT Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
8
|
Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021; 373:871-876. [PMID: 34282049 PMCID: PMC7612213 DOI: 10.1126/science.abj8754] [Citation(s) in RCA: 2053] [Impact Index Per Article: 684.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 07/07/2021] [Indexed: 01/17/2023]
Abstract
DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo-electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.
Collapse
Affiliation(s)
- Minkyung Baek
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Frank DiMaio
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Justas Dauparas
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Sergey Ovchinnikov
- Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138, USA
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA 02138, USA
| | - Gyu Rie Lee
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Jue Wang
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Claudia Millán
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - Hahnbeom Park
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Carson Adams
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Caleb R Glassman
- Program in Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Andy DeGiovanni
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Jose H Pereira
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Andria V Rodrigues
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Alberdina A van Dijk
- Department of Biochemistry, Focus Area Human Metabolomics, North-West University, 2531 Potchefstroom, South Africa
| | - Ana C Ebrecht
- Department of Biochemistry, Focus Area Human Metabolomics, North-West University, 2531 Potchefstroom, South Africa
| | - Diederik J Opperman
- Department of Biotechnology, University of the Free State, 205 Nelson Mandela Drive, Bloemfontein 9300, South Africa
| | - Theo Sagmeister
- Institute of Molecular Biosciences, University of Graz, Humboldtstrasse 50, 8010 Graz, Austria
| | - Christoph Buhlheller
- Institute of Molecular Biosciences, University of Graz, Humboldtstrasse 50, 8010 Graz, Austria
- Medical University of Graz, Graz, Austria
| | - Tea Pavkov-Keller
- Institute of Molecular Biosciences, University of Graz, Humboldtstrasse 50, 8010 Graz, Austria
- BioTechMed-Graz, Graz, Austria
| | - Manoj K Rathinaswamy
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, Canada
| | - Udit Dalwadi
- Life Sciences Institute, Department of Biochemistry and Molecular Biology, The University of British Columbia, Vancouver, BC, Canada
| | - Calvin K Yip
- Life Sciences Institute, Department of Biochemistry and Molecular Biology, The University of British Columbia, Vancouver, BC, Canada
| | - John E Burke
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, Canada
| | - K Christopher Garcia
- Program in Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Paul D Adams
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Randy J Read
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA.
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|
9
|
Kinch LN, Schaeffer RD, Kryshtafovych A, Grishin NV. Target classification in the 14th round of the critical assessment of protein structure prediction (CASP14). Proteins 2021; 89:1618-1632. [PMID: 34350630 DOI: 10.1002/prot.26202] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Revised: 06/21/2021] [Accepted: 07/11/2021] [Indexed: 12/14/2022]
Abstract
An evolutionary-based definition and classification of target evaluation units (EUs) is presented for the 14th round of the critical assessment of structure prediction (CASP14). CASP14 targets included 84 experimental models submitted by various structural groups (designated T1024-T1101). Targets were split into EUs based on the domain organization of available templates and performance of server groups. Several targets required splitting (19 out of 25 multidomain targets) due in part to observed conformation changes. All in all, 96 CASP14 EUs were defined and assigned to tertiary structure assessment categories (Topology-based FM or High Accuracy-based TBM-easy and TBM-hard) considering their evolutionary relationship to existing ECOD fold space: 24 family level, 50 distant homologs (H-group), 12 analogs (X-group), and 10 new folds. Principal component analysis and heatmap visualization of sequence and structure similarity to known templates as well as performance of servers highlighted trends in CASP14 target difficulty. The assigned evolutionary levels (i.e., H-groups) and assessment classes (i.e., FM) displayed overlapping clusters of EUs. Many viral targets diverged considerably from their template homologs and thus were more difficult for prediction than other homology-related targets. On the other hand, some targets did not have sequence-identifiable templates, but were predicted better than expected due to relatively simple arrangements of secondary structural elements. An apparent improvement in overall server performance in CASP14 further complicated traditional classification, which ultimately assigned EUs into high-accuracy modeling (27 TBM-easy and 31 TBM-hard), topology (23 FM), or both (15 FM/TBM).
Collapse
Affiliation(s)
- Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | | | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
10
|
Kinch LN, Pei J, Kryshtafovych A, Schaeffer RD, Grishin NV. Topology evaluation of models for difficult targets in the 14th round of the critical assessment of protein structure prediction. Proteins 2021; 89:1673-1686. [PMID: 34240477 DOI: 10.1002/prot.26172] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2021] [Revised: 06/28/2021] [Accepted: 07/01/2021] [Indexed: 12/25/2022]
Abstract
This report describes the tertiary structure prediction assessment of difficult modeling targets in the 14th round of the Critical Assessment of Structure Prediction (CASP14). We implemented an official ranking scheme that used the same scores as the previous CASP topology-based assessment, but combined these scores with one that emphasized physically realistic models. The top performing AlphaFold2 group outperformed the rest of the prediction community on all but two of the difficult targets considered in this assessment. They provided high quality models for most of the targets (86% over GDT_TS 70), including larger targets above 150 residues, and they correctly predicted the topology of almost all the rest. AlphaFold2 performance was followed by two manual Baker methods, a Feig method that refined Zhang-server models, two notable automated Zhang server methods (QUARK and Zhang-server), and a Zhang manual group. Despite the remarkable progress in protein structure prediction of difficult targets, both the prediction community and AlphaFold2, to a lesser extent, faced challenges with flexible regions and obligate oligomeric assemblies. The official ranking of top-performing methods was supported by performance generated PCA and heatmap clusters that gave insight into target difficulties and the most successful state-of-the-art structure prediction methodologies.
Collapse
Affiliation(s)
- Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | | | - R Dustin Schaeffer
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, USA.,Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, USA
| |
Collapse
|
11
|
Schaeffer RD, Kinch LN, Pei J, Medvedev KE, Grishin NV. Completeness and Consistency in Structural Domain Classifications. ACS Omega 2021; 6:15698-15707. [PMID: 34179613 PMCID: PMC8223206 DOI: 10.1021/acsomega.1c00950] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 05/25/2021] [Indexed: 06/13/2023]
Abstract
Domain classifications are a useful resource for computational analysis of the protein structure, but elements of their composition are often opaque to potential users. We perform a comparative analysis of our classification ECOD against the SCOPe, SCOP2, and CATH domain classifications with respect to their constituent domain boundaries and hierarchal organization. The coverage of these domain classifications with respect to ECOD and to the PDB was assessed by structure and by sequence. We also conducted domain pair analysis to determine broad differences in hierarchy between domains shared by ECOD and other classifications. Finally, we present domains from the major facilitator superfamily (MFS) of transporter proteins and provide evidence that supports their split into domains and for multiple conformations within these families. We find that the ECOD and CATH provide the most extensive structural coverage of the PDB. ECOD and SCOPe have the most consistent domain boundary conditions, whereas CATH and SCOP2 both differ significantly.
Collapse
Affiliation(s)
- R. Dustin Schaeffer
- Departments
of Biophysics and Biochemistry, University
of Texas Southwestern Medical Center, Dallas, Texas 75390, United States
| | - Lisa N. Kinch
- Howard
Hughes Medical Institute, University of
Texas Southwestern Medical Center, Dallas, Texas 75390, United States
| | - Jimin Pei
- Howard
Hughes Medical Institute, University of
Texas Southwestern Medical Center, Dallas, Texas 75390, United States
| | - Kirill E. Medvedev
- Departments
of Biophysics and Biochemistry, University
of Texas Southwestern Medical Center, Dallas, Texas 75390, United States
| | - Nick V. Grishin
- Departments
of Biophysics and Biochemistry, University
of Texas Southwestern Medical Center, Dallas, Texas 75390, United States
- Howard
Hughes Medical Institute, University of
Texas Southwestern Medical Center, Dallas, Texas 75390, United States
| |
Collapse
|
12
|
Medvedev KE, Kinch LN, Dustin Schaeffer R, Pei J, Grishin NV. A Fifth of the Protein World: Rossmann-like Proteins as an Evolutionarily Successful Structural unit. J Mol Biol 2020; 433:166788. [PMID: 33387532 DOI: 10.1016/j.jmb.2020.166788] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 11/26/2020] [Accepted: 12/18/2020] [Indexed: 10/22/2022]
Abstract
The Rossmann-like fold is the most prevalent and diversified doubly-wound superfold of ancient evolutionary origin. Rossmann-like domains are present in a variety of metabolic enzymes and are capable of binding diverse ligands. Discerning evolutionary relationships among these domains is challenging because of their diverse functions and ancient origin. We defined a minimal Rossmann-like structural motif (RLM), identified RLM-containing domains among known 3D structures (20%) and classified them according to their homologous relationships. New classifications were incorporated into our Evolutionary Classification of protein Domains (ECOD) database. We defined 156 homology groups (H-groups), which were further clustered into 123 possible homology groups (X-groups). Our analysis revealed that RLM-containing proteins constitute approximately 15% of the human proteome. We found that disease-causing mutations are more frequent within RLM domains than within non-RLM domains of these proteins, highlighting the importance of RLM-containing proteins for human health.
Collapse
Affiliation(s)
- Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, United States.
| | - Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, United States; Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, United States; Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, United States.
| |
Collapse
|
13
|
Medvedev KE, Kinch LN, Schaeffer RD, Grishin NV. Functional analysis of Rossmann-like domains reveals convergent evolution of topology and reaction pathways. PLoS Comput Biol 2019; 15:e1007569. [PMID: 31869345 PMCID: PMC6957218 DOI: 10.1371/journal.pcbi.1007569] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2019] [Revised: 01/13/2020] [Accepted: 11/26/2019] [Indexed: 12/18/2022] Open
Abstract
Rossmann folds are ancient, frequently diverged domains found in many biological reaction pathways where they have adapted for different functions. Consequently, discernment and classification of their homologous relations and function can be complicated. We define a minimal Rossmann-like structure motif (RLM) that corresponds for the common core of known Rossmann domains and use this motif to identify all RLM domains in the Protein Data Bank (PDB), thus finding they constitute about 20% of all known 3D structures. The Evolutionary Classification of protein structure Domains (ECOD) classifies RLM domains in a number of groups that lack evidence for homology (X-groups), which suggests that they could have evolved independently multiple times. Closely related, homologous RLM enzyme families can diverge to bind different ligands using similar binding sites and to catalyze different reactions. Conversely, non-homologous RLM domains can converge to catalyze the same reactions or to bind the same ligand with alternate binding modes. We discuss a special case of such convergent evolution that is relevant to the polypharmacology paradigm, wherein the same drug (methotrexate) binds to multiple non-homologous RLM drug targets with different topologies. Finally, assigning proteins with RLM domain to the Enzyme Commission classification suggest that RLM enzymes function mainly in metabolism (and comprise 38% of reference metabolic pathways) and are overrepresented in extant pathways that represent ancient biosynthetic routes such as nucleotide metabolism, energy metabolism, and metabolism of amino acids. In fact, RLM enzymes take part in five out of eight enzymatic reactions of the Wood-Ljungdahl metabolic pathway thought to be used by the last universal common ancestor (LUCA). The prevalence of RLM domains in this ancient metabolism might explain their wide distribution among enzymes.
Collapse
Affiliation(s)
- Kirill E. Medvedev
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Lisa N. Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - R. Dustin Schaeffer
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| |
Collapse
|
14
|
Liao Y, Schaeffer RD, Pei J, Grishin NV. A sequence family database built on ECOD structural domains. Bioinformatics 2019; 34:2997-3003. [PMID: 29659718 DOI: 10.1093/bioinformatics/bty214] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2018] [Accepted: 04/03/2018] [Indexed: 11/12/2022] Open
Abstract
Motivation The ECOD database classifies protein domains based on their evolutionary relationships, considering both remote and close homology. The family group in ECOD provides classification of domains that are closely related to each other based on sequence similarity. Due to different perspectives on domain definition, direct application of existing sequence domain databases, such as Pfam, to ECOD struggles with several shortcomings. Results We created multiple sequence alignments and profiles from ECOD domains with the help of structural information in alignment building and boundary delineation. We validated the alignment quality by scoring structure superposition to demonstrate that they are comparable to curated seed alignments in Pfam. Comparison to Pfam and CDD reveals that 27 and 16% of ECOD families are new, but they are also dominated by small families, likely because of the sampling bias from the PDB database. There are 35 and 48% of families whose boundaries are modified comparing to counterparts in Pfam and CDD, respectively. Availability and implementation The new families are now integrated in the ECOD website. The aggregate HMMER profile library and alignment are available for download on ECOD website (http://prodata.swmed.edu/ecod). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuxing Liao
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - R Dustin Schaeffer
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Jimin Pei
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Nick V Grishin
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| |
Collapse
|
15
|
Schaeffer RD, Kinch L, Medvedev KE, Pei J, Cheng H, Grishin N. ECOD: identification of distant homology among multidomain and transmembrane domain proteins. BMC Mol Cell Biol 2019; 20:18. [PMID: 31226926 PMCID: PMC6588880 DOI: 10.1186/s12860-019-0204-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2019] [Accepted: 06/02/2019] [Indexed: 12/03/2022] Open
Abstract
The manual classification of protein domains is approaching its 20th anniversary. ECOD is our mixed manual-automatic domain classification. Over time, the types of proteins which require manual curation has changed. Depositions with complex multidomain and multichain arrangements are commonplace. Transmembrane domains are regularly classified. Repeatedly, domains which are initially believed to be novel are found to have homologous links to existing classified domains. Here we present a brief summary of recent manual curation efforts in ECOD generally combined with specific case studies of transmembrane and multidomain proteins wherein manual curation was useful for discovering new homologous relationships. We present a new taxonomy for the classification of ABC transporter transmembrane domains. We examine alternate topologies of the leucine-specific (LS) domain of Leucine tRNA-synthetase. Finally, we elaborate on a distant homologous links between two helical dimerization domains.
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390-9050, USA.
| | - Lisa Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390-9050, USA
| | - Kirill E Medvedev
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390-9050, USA
| | - Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390-9050, USA
| | - Hua Cheng
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390-9050, USA
| | - Nick Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390-9050, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390-9050, USA
| |
Collapse
|
16
|
Affiliation(s)
- R. Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center; Dallas Texas
| | - Yuxing Liao
- Department of Biophysics, University of Texas Southwestern Medical Center; Dallas Texas
| | - Nick V. Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center; Dallas Texas
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center; Dallas Texas
| |
Collapse
|
17
|
Schaeffer RD, Liao Y, Cheng H, Grishin NV. ECOD: new developments in the evolutionary classification of domains. Nucleic Acids Res 2016; 45:D296-D302. [PMID: 27899594 PMCID: PMC5210594 DOI: 10.1093/nar/gkw1137] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2016] [Revised: 10/25/2016] [Accepted: 11/16/2016] [Indexed: 12/21/2022] Open
Abstract
Evolutionary Classification Of protein Domains (ECOD) (http://prodata.swmed.edu/ecod) comprehensively classifies protein with known spatial structures maintained by the Protein Data Bank (PDB) into evolutionary groups of protein domains. ECOD relies on a combination of automatic and manual weekly updates to achieve its high accuracy and coverage with a short update cycle. ECOD classifies the approximately 120 000 depositions of the PDB into more than 500 000 domains in ∼3400 homologous groups. We show the performance of the weekly update pipeline since the release of ECOD, describe improvements to the ECOD website and available search options, and discuss novel structures and homologous groups that have been classified in the recent updates. Finally, we discuss the future directions of ECOD and further improvements planned for the hierarchy and update process.
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA
| | - Yuxing Liao
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA
| | - Hua Cheng
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA.,Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX 75390-9050, USA
| |
Collapse
|
18
|
Li W, Schaeffer RD, Otwinowski Z, Grishin NV. Estimation of Uncertainties in the Global Distance Test (GDT_TS) for CASP Models. PLoS One 2016; 11:e0154786. [PMID: 27149620 PMCID: PMC4858170 DOI: 10.1371/journal.pone.0154786] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2016] [Accepted: 04/19/2016] [Indexed: 11/19/2022] Open
Abstract
The Critical Assessment of techniques for protein Structure Prediction (or CASP) is a community-wide blind test experiment to reveal the best accomplishments of structure modeling. Assessors have been using the Global Distance Test (GDT_TS) measure to quantify prediction performance since CASP3 in 1998. However, identifying significant score differences between close models is difficult because of the lack of uncertainty estimations for this measure. Here, we utilized the atomic fluctuations caused by structure flexibility to estimate the uncertainty of GDT_TS scores. Structures determined by nuclear magnetic resonance are deposited as ensembles of alternative conformers that reflect the structural flexibility, whereas standard X-ray refinement produces the static structure averaged over time and space for the dynamic ensembles. To recapitulate the structural heterogeneous ensemble in the crystal lattice, we performed time-averaged refinement for X-ray datasets to generate structural ensembles for our GDT_TS uncertainty analysis. Using those generated ensembles, our study demonstrates that the time-averaged refinements produced structure ensembles with better agreement with the experimental datasets than the averaged X-ray structures with B-factors. The uncertainty of the GDT_TS scores, quantified by their standard deviations (SDs), increases for scores lower than 50 and 70, with maximum SDs of 0.3 and 1.23 for X-ray and NMR structures, respectively. We also applied our procedure to the high accuracy version of GDT-based score and produced similar results with slightly higher SDs. To facilitate score comparisons by the community, we developed a user-friendly web server that produces structure ensembles for NMR and X-ray structures and is accessible at http://prodata.swmed.edu/SEnCS. Our work helps to identify the significance of GDT_TS score differences, as well as to provide structure ensembles for estimating SDs of any scores.
Collapse
Affiliation(s)
- Wenlin Li
- Department of Biochemistry and Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390–9050, United States of America
| | - R. Dustin Schaeffer
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390–9050, United States of America
| | - Zbyszek Otwinowski
- Department of Biochemistry and Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390–9050, United States of America
| | - Nick V. Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390–9050, United States of America
- Department of Biochemistry and Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390–9050, United States of America
- * E-mail:
| |
Collapse
|
19
|
Schaeffer RD, Kinch LN, Liao Y, Grishin NV. Classification of proteins with shared motifs and internal repeats in the ECOD database. Protein Sci 2016; 25:1188-203. [PMID: 26833690 DOI: 10.1002/pro.2893] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2015] [Revised: 01/23/2016] [Accepted: 01/27/2016] [Indexed: 12/19/2022]
Abstract
Proteins and their domains evolve by a set of events commonly including the duplication and divergence of small motifs. The presence of short repetitive regions in domains has generally constituted a difficult case for structural domain classifications and their hierarchies. We developed the Evolutionary Classification Of protein Domains (ECOD) in part to implement a new schema for the classification of these types of proteins. Here we document the ways in which ECOD classifies proteins with small internal repeats, widespread functional motifs, and assemblies of small domain-like fragments in its evolutionary schema. We illustrate the ways in which the structural genomics project impacted the classification and characterization of new structural domains and sequence families over the decade.
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050
| | - Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050
| | - Yuxing Liao
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050.,Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, Texas, 75390-9050
| |
Collapse
|
20
|
Kinch LN, Li W, Schaeffer RD, Dunbrack RL, Monastyrskyy B, Kryshtafovych A, Grishin NV. CASP 11 target classification. Proteins 2016; 84 Suppl 1:20-33. [PMID: 26756794 DOI: 10.1002/prot.24982] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2015] [Revised: 12/22/2015] [Accepted: 01/05/2016] [Indexed: 11/09/2022]
Abstract
Protein target structures for the Critical Assessment of Structure Prediction round 11 (CASP11) and CASP ROLL were split into domains and classified into categories suitable for assessment of template-based modeling (TBM) and free modeling (FM) based on their evolutionary relatedness to existing structures classified by the Evolutionary Classification of Protein Domains (ECOD) database. First, target structures were divided into domain-based evaluation units. Target splits were based on the domain organization of available templates as well as the performance of servers on whole targets compared to split target domains. Second, evaluation units were classified into TBM and FM categories using a combination of measures that evaluate prediction quality and template detectability. Generally, target domains with sequence-related templates and good server prediction performance were classified as TBM, whereas targets without sequence-identifiable templates and low server performance were classified as FM. As in previous CASP experiments, the boundaries for classification were blurred due to the presence of significant insertions and deteriorations in the targets with respect to homologous templates, as well as the presence of templates with partial coverage of new folds. The FM category included 45 target domains, which represents an unprecedented number of difficult CASP targets provided for modeling. Proteins 2016; 84(Suppl 1):20-33. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9050.
| | - Wenlin Li
- Department of Biophysics, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9050.,Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9050
| | - R Dustin Schaeffer
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9050
| | - Roland L Dunbrack
- Institute for Cancer Research, 333 Cottman Avenue, Philadelphia, 19111, Pennsylvania Fox Chase Cancer Center
| | - Bohdan Monastyrskyy
- Genome Center, University of California, 451 Health Sciences Drive, Davis, 95616, California
| | - Andriy Kryshtafovych
- Genome Center, University of California, 451 Health Sciences Drive, Davis, 95616, California
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9050.,Department of Biophysics, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9050.,Department of Biochemistry, University of Texas Southwestern Medical Center at Dallas, Dallas, Texas 75390-9050
| |
Collapse
|
21
|
Cheng H, Liao Y, Schaeffer RD, Grishin NV. Manual classification strategies in the ECOD database. Proteins 2015; 83:1238-51. [PMID: 25917548 DOI: 10.1002/prot.24818] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2015] [Revised: 03/30/2015] [Accepted: 04/19/2015] [Indexed: 12/28/2022]
Abstract
ECOD (Evolutionary Classification Of protein Domains) is a comprehensive and up-to-date protein structure classification database. The majority of new structures released from the PDB (Protein Data Bank) each week already have close homologs in the ECOD hierarchy and thus can be reliably partitioned into domains and classified by software without manual intervention. However, those proteins that lack confidently detectable homologs require careful analysis by experts. Although many bioinformatics resources rely on expert curation to some degree, specific examples of how this curation occurs and in what cases it is necessary are not always described. Here, we illustrate the manual classification strategy in ECOD by example, focusing on two major issues in protein classification: domain partitioning and the relationship between homology and similarity scores. Most examples show recently released and manually classified PDB structures. We discuss multi-domain proteins, discordance between sequence and structural similarities, difficulties with assessing homology with scores, and integral membrane proteins homologous to soluble proteins. By timely assimilation of newly available structures into its hierarchy, ECOD strives to provide a most accurate and updated view of the protein structure world as a result of combined computational and expert-driven analysis.
Collapse
Affiliation(s)
- Hua Cheng
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| | - Yuxing Liao
- Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| | - R Dustin Schaeffer
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| | - Nick V Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, 75390.,Department of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, 75390
| |
Collapse
|
22
|
Cheng H, Schaeffer RD, Liao Y, Kinch LN, Pei J, Shi S, Kim BH, Grishin NV. ECOD: an evolutionary classification of protein domains. PLoS Comput Biol 2014; 10:e1003926. [PMID: 25474468 PMCID: PMC4256011 DOI: 10.1371/journal.pcbi.1003926] [Citation(s) in RCA: 225] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2014] [Accepted: 09/22/2014] [Indexed: 01/02/2023] Open
Abstract
Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or "fold"). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies.
Collapse
Affiliation(s)
- Hua Cheng
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - R. Dustin Schaeffer
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Yuxing Liao
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Lisa N. Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Shuoyong Shi
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Bong-Hyun Kim
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
| | - Nick V. Grishin
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- Departments of Biophysics and Biochemistry, University of Texas Southwestern Medical Center, Dallas, Texas, United States of America
- * E-mail:
| |
Collapse
|
23
|
Dar TA, Schaeffer RD, Daggett V, Bowler BE. Manifestations of native topology in the denatured state ensemble of Rhodopseudomonas palustris cytochrome c'. Biochemistry 2011; 50:1029-41. [PMID: 21190388 PMCID: PMC3329124 DOI: 10.1021/bi101551h] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
To provide insight into the role of local sequence in the nonrandom coil behavior of the denatured state, we have extended our measurements of histidine-heme loop formation equilibria for cytochrome c' to 6 M guanidine hydrochloride. We observe that there is some reduction in the scatter about the best fit line of loop stability versus loop size data in 6 M versus 3 M guanidine hydrochloride, but the scatter is not eliminated. The scaling exponent, ν(3), of 2.5 ± 0.2 is also similar to that found previously in 3 M guanidine hydrochloride (2.6 ± 0.3). Rates of histidine-heme loop breakage in the denatured state of cytochrome c' show that some histidine-heme loops are significantly more persistent than others at both 3 and 6 M guanidine hydrochloride. Rates of histidine-heme loop formation more closely approximate random coil behavior. This observation indicates that heterogeneity in the denatured state ensemble results mainly from contact persistence. When mapped onto the structure of cytochrome c', the histidine-heme loops with slow breakage rates coincide with chain reversals between helices 1 and 2 and between helices 2 and 3. Molecular dynamics simulations of the unfolding of cytochrome c' at 498 K show that these reverse turns persist in the unfolded state. Thus, these portions of the primary structure of cytochrome c' set up the topology of cytochrome c' in the denatured state, predisposing the protein to fold efficiently to its native structure.
Collapse
Affiliation(s)
- Tanveer A. Dar
- Department of Chemistry & Biochemistry, Center for Biomolecular Structure & Dynamics, University of Montana, Missoula, Montana, 59812, USA
| | - R. Dustin Schaeffer
- Biomolecular Structure & Design Program, University of Washington, Seattle, WA 98195 USA
| | - Valerie Daggett
- Biomolecular Structure & Design Program, University of Washington, Seattle, WA 98195 USA
- Department of Bioengineering, University of Washington, Seattle, WA 98195-5013 USA
| | - Bruce E. Bowler
- Department of Chemistry & Biochemistry, Center for Biomolecular Structure & Dynamics, University of Montana, Missoula, Montana, 59812, USA
| |
Collapse
|
24
|
Abstract
The classification of protein folds is necessarily based on the structural elements that distinguish domains. Classification of protein domains consists of two problems: the partition of structures into domains and the classification of domains into sets of similar structures (or folds). Although similar topologies may arise by convergent evolution, the similarity of their respective folding pathways is unknown. The discovery and the characterization of the majority of protein folds will be followed by a similar enumeration of available protein folding pathways. Consequently, understanding the intricacies of structural domains is necessary to understanding their collective folding pathways. We review the current state of the art in the field of protein domain classification and discuss methods for the systematic and comprehensive study of protein folding across protein fold space via atomistic molecular dynamics simulation. Finally, we discuss our large-scale Dynameomics project, which includes simulations of representatives of all autonomous protein folds.
Collapse
Affiliation(s)
| | - Valerie Daggett
- Department of Bioengineering, University of Washington, Seattle, WA 98195-5013, USA
| |
Collapse
|
25
|
Abstract
AbstractAll currently known structures of proteins together define ‘protein fold space’. To increase the general understanding of protein dynamics and protein folding, we selected a set of 807 proteins and protein domains that represent 95% of the currently known autonomous folded domains present in globular proteins. Native state and unfolding simulations of these representatives are now complete and accessible via a novel database containing over 11 000 simulations. Because protein folding is a microscopically reversible process, these simulations effectively sample protein folding across all of protein fold space. Here, we give an overview of how the representative proteins were selected and how the simulations were performed and validated. We then provide examples of different types of analyses that can be performed across our large set of simulations, made possible by the database approach. We further show how the unfolding simulations can be used to compare unfolding of structural elements in isolation and in different structural contexts, using as an example a short, triple stranded β-sheet that forms the WW domain and is present in several larger unrelated proteins.
Collapse
Affiliation(s)
- Amanda L. Jonsson
- 1Department of Bioengineering, University of Washington, Box 355013, Seattle, WA 98195-5013, USA
| | - R. Dustin Schaeffer
- 2Biomolecular Structure and Design Program, University of Washington, Box 355013, Seattle, WA 98195-5013, USA
| | - Marc W. van der Kamp
- 1Department of Bioengineering, University of Washington, Box 355013, Seattle, WA 98195-5013, USA
| | | |
Collapse
|
26
|
Abstract
MOTIVATION The discovery of new protein folds is a relatively rare occurrence even as the rate of protein structure determination increases. This rarity reinforces the concept of folds as reusable units of structure and function shared by diverse proteins. If the folding mechanism of proteins is largely determined by their topology, then the folding pathways of members of existing folds could encompass the full set used by globular protein domains. RESULTS We have used recent versions of three common protein domain dictionaries (SCOP, CATH and Dali) to generate a consensus domain dictionary (CDD). Surprisingly, 40% of the metafolds in the CDD are not composed of autonomous structural domains, i.e. they are not plausible independent folding units. This finding has serious ramifications for bioinformatics studies mining these domain dictionaries for globular protein properties. However, our main purpose in deriving this CDD was to generate an updated CDD to choose targets for MD simulation as part of our dynameomics effort, which aims to simulate the native and unfolding pathways of representatives of all globular protein consensus folds (metafolds). Consequently, we also compiled a list of representative protein targets of each metafold in the CDD. AVAILABILITY AND IMPLEMENTATION This domain dictionary is available at www.dynameomics.org.
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Biomolecular Structure and Design Program, University of Washington, Seattle, WA 98195-5013, USA
| | | | | | | |
Collapse
|
27
|
Beck DAC, Jonsson AL, Schaeffer RD, Scott KA, Day R, Toofanny RD, Alonso DOV, Daggett V. Dynameomics: mass annotation of protein dynamics and unfolding in water by high-throughput atomistic molecular dynamics simulations. Protein Eng Des Sel 2008; 21:353-68. [PMID: 18411224 DOI: 10.1093/protein/gzn011] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The goal of Dynameomics is to perform atomistic molecular dynamics (MD) simulations of representative proteins from all known folds in explicit water in their native state and along their thermal unfolding pathways. Here we present 188-fold representatives and their native state simulations and analyses. These 188 targets represent 67% of all the structures in the Protein Data Bank. The behavior of several specific targets is highlighted to illustrate general properties in the full dataset and to demonstrate the role of MD in understanding protein function and stability. As an example of what can be learned from mining the Dynameomics database, we identified a protein fold with heightened localized dynamics. In one member of this fold family, the motion affects the exposure of its phosphorylation site and acts as an entropy sink to offset another portion of the protein that is relatively immobile in order to present a consistent interface for protein docking. In another member of this family, a polymorphism in the highly mobile region leads to a host of disease phenotypes. We have constructed a web site to provide access to a novel hybrid relational/multidimensional database (described in the succeeding two papers) to view and interrogate simulations of the top 30 targets: http://www.dynameomics.org. The Dynameomics database, currently the largest collection of protein simulations and protein structures in the world, should also be useful for determining the rules governing protein folding and kinetic stability, which should aid in deciphering genomic information and for protein engineering and design.
Collapse
Affiliation(s)
- David A C Beck
- Department of Bioengineering, University of Washington, Seattle, WA 98195-5013, USA
| | | | | | | | | | | | | | | |
Collapse
|
28
|
Schaeffer RD, Fersht A, Daggett V. Combining experiment and simulation in protein folding: closing the gap for small model systems. Curr Opin Struct Biol 2008; 18:4-9. [PMID: 18242977 DOI: 10.1016/j.sbi.2007.11.007] [Citation(s) in RCA: 87] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2007] [Accepted: 11/29/2007] [Indexed: 11/30/2022]
Abstract
All-atom molecular dynamics (MD) simulations on increasingly powerful computers have been combined with experiments to characterize protein folding in detail over wider time ranges. The folding of small ultrafast folding proteins is being simulated on micros timescales, leading to improved structural predictions and folding rates. To what extent is 'closing the gap' between simulation and experiment for such systems providing insights into general mechanisms of protein folding?
Collapse
Affiliation(s)
- R Dustin Schaeffer
- Biomolecular Structure & Design Program, University of Washington, Seattle, WA 98195, USA
| | | | | |
Collapse
|
29
|
Schaeffer RD, Sproul JC, O'Connell J, van Vloten C, Mantz AW. Multipass absorption cell designed for high temperature UHV operation. Appl Opt 1989; 28:1710-1713. [PMID: 20548730 DOI: 10.1364/ao.28.001710] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
This paper describes the design and performance of a tunable diode laser system, incorporating a multipass absorption cell to allow determination of water concentration below 10 ppb and carbon dioxide concentration below 1 ppb in nitrogen semiconductor gas. The cell is used with a tunable Pb-salt diode laser spectrometer frequency locked to a first derivative error signal from an intense vibration-rotation molecular absorption line. Sample concentrations are monitored in the second derivative mode, and the system automatically compensates for laser intensity fluctuations. A flowing gas method is used to minimize adsorption/desorption effects from the sample cell walls.
Collapse
|