1
|
de la Flor G, Aroyo MI, Gimondi I, Ward SC, Momma K, Hanson RM, Suescun L. Free tools for crystallographic symmetry handling and visualization. J Appl Crystallogr 2024; 57:1618-1639. [PMID: 39387077 PMCID: PMC11460394 DOI: 10.1107/s1600576724007659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Accepted: 08/02/2024] [Indexed: 10/12/2024] Open
Abstract
Online courses and innovative teaching methods have triggered a trend in education, where the integration of multimedia, online resources and interactive tools is reshaping the view of both virtual and traditional classrooms. The use of interactive tools extends beyond the boundaries of the physical classroom, offering students the flexibility to access materials at their own speed and convenience and enhancing their learning experience. In the field of crystallography, there are a wide variety of free online resources such as web pages, interactive applets, databases and programs that can be implemented in fundamental crystallography courses for different academic levels and curricula. This paper discusses a variety of resources that can be helpful for crystallographic symmetry handling and visualization, discussing four specific resources in detail: the Bilbao Crystallographic Server, the Cambridge Structural Database, VESTA and Jmol. The utility of these resources is explained and shown by several illustrative examples.
Collapse
Affiliation(s)
- Gemma de la Flor
- Institute of Applied GeosciencesKarlsruhe Institute of TechnologyKarlsruheGermany
| | - Mois I. Aroyo
- Departamento de FísicaUniversidad del País Vasco UPV/EHUSpain
| | | | | | | | | | - Leopoldo Suescun
- Cryssmat-Lab/DETEMA, Facultad de QuímicaUniversidad de la RepúblicaMontevideoUruguay
| |
Collapse
|
2
|
Rose A, Sehnal D, Goodsell DS, Autin L. Mesoscale explorer: Visual exploration of large-scale molecular models. Protein Sci 2024; 33:e5177. [PMID: 39291955 PMCID: PMC11409463 DOI: 10.1002/pro.5177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Revised: 08/29/2024] [Accepted: 08/31/2024] [Indexed: 09/19/2024]
Abstract
The advent of cryo-electron microscopy (cryo-EM) and cryo-electron tomography (cryo-ET), coupled with computational modeling, has enabled the creation of integrative 3D models of viruses, bacteria, and cellular organelles. These models, composed of thousands of macromolecules and billions of atoms, have historically posed significant challenges for manipulation and visualization without specialized molecular graphics tools and hardware. With the recent advancements in GPU rendering power and web browser capabilities, it is now feasible to render interactively large molecular scenes directly on the web. In this work, we introduce Mesoscale Explorer, a web application built using the Mol* framework, dedicated to the visualization of large-scale molecular models ranging from viruses to cell organelles. Mesoscale Explorer provides unprecedented access and insight into the molecular fabric of life, enhancing perception, streamlining exploration, and simplifying visualization of diverse data types, showcasing the intricate details of these models with unparalleled clarity.
Collapse
Affiliation(s)
| | - David Sehnal
- National Centre for Biomolecular Research, Faculty of ScienceMasaryk UniversityBrnoCzech Republic
| | - David S. Goodsell
- Department of Integrative Structural and Computational BiologyThe Scripps Research InstituteLa JollaCaliforniaUSA
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, RutgersThe State University of New JerseyPiscatawayNew JerseyUSA
| | - Ludovic Autin
- Department of Integrative Structural and Computational BiologyThe Scripps Research InstituteLa JollaCaliforniaUSA
| |
Collapse
|
3
|
Rose A, Sehnal D, Goodsell DS, Autin L. Mesoscale Explorer - Visual Exploration of Large-Scale Molecular Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.02.610826. [PMID: 39282403 PMCID: PMC11398308 DOI: 10.1101/2024.09.02.610826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
The advent of cryo-electron microscopy (cryo-EM) and cryo-electron tomography (cryo-ET), coupled with computational modeling, has enabled the creation of integrative 3D models of viruses, bacteria, and cellular organelles. These models, composed of thousands of macromolecules and billions of atoms, have historically posed significant challenges for manipulation and visualization without specialized molecular graphics tools and hardware. With the recent advancements in GPU rendering power and web browser capabilities, it is now feasible to render interactively large molecular scenes directly on the web. In this work, we introduce Mesoscale Explorer, a web application built using the Mol* framework, dedicated to the visualization of large-scale molecular models ranging from viruses to cell organelles. Mesoscale Explorer provides unprecedented access and insight into the molecular fabric of life, enhancing perception, streamlining exploration, and simplifying visualization of diverse data types, showcasing the intricate details of these models with unparalleled clarity.
Collapse
Affiliation(s)
| | - David Sehnal
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, 625 00, Brno, Czech Republic
| | - David S Goodsell
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Ludovic Autin
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| |
Collapse
|
4
|
Vallat B, Webb BM, Westbrook JD, Goddard TD, Hanke CA, Graziadei A, Peisach E, Zalevsky A, Sagendorf J, Tangmunarunkit H, Voinea S, Sekharan M, Yu J, Bonvin AAMJJ, DiMaio F, Hummer G, Meiler J, Tajkhorshid E, Ferrin TE, Lawson CL, Leitner A, Rappsilber J, Seidel CAM, Jeffries CM, Burley SK, Hoch JC, Kurisu G, Morris K, Patwardhan A, Velankar S, Schwede T, Trewhella J, Kesselman C, Berman HM, Sali A. IHMCIF: An Extension of the PDBx/mmCIF Data Standard for Integrative Structure Determination Methods. J Mol Biol 2024; 436:168546. [PMID: 38508301 PMCID: PMC11377171 DOI: 10.1016/j.jmb.2024.168546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 03/11/2024] [Accepted: 03/14/2024] [Indexed: 03/22/2024]
Abstract
IHMCIF (github.com/ihmwg/IHMCIF) is a data information framework that supports archiving and disseminating macromolecular structures determined by integrative or hybrid modeling (IHM), and making them Findable, Accessible, Interoperable, and Reusable (FAIR). IHMCIF is an extension of the Protein Data Bank Exchange/macromolecular Crystallographic Information Framework (PDBx/mmCIF) that serves as the framework for the Protein Data Bank (PDB) to archive experimentally determined atomic structures of biological macromolecules and their complexes with one another and small molecule ligands (e.g., enzyme cofactors and drugs). IHMCIF serves as the foundational data standard for the PDB-Dev prototype system, developed for archiving and disseminating integrative structures. It utilizes a flexible data representation to describe integrative structures that span multiple spatiotemporal scales and structural states with definitions for restraints from a variety of experimental methods contributing to integrative structural biology. The IHMCIF extension was created with the benefit of considerable community input and recommendations gathered by the Worldwide Protein Data Bank (wwPDB) Task Force for Integrative or Hybrid Methods (wwpdb.org/task/hybrid). Herein, we describe the development of IHMCIF to support evolving methodologies and ongoing advancements in integrative structural biology. Ultimately, IHMCIF will facilitate the unification of PDB-Dev data and tools with the PDB archive so that integrative structures can be archived and disseminated through PDB.
Collapse
Affiliation(s)
- Brinda Vallat
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA.
| | - Benjamin M Webb
- Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, the Quantitative Biosciences Institute (QBI), and the Research Collaboratory for Structural Bioinformatics Protein Data Bank, University of California, San Francisco, San Francisco, CA 94157, USA
| | - John D Westbrook
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| | - Thomas D Goddard
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158, USA
| | - Christian A Hanke
- Molecular Physical Chemistry, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
| | - Andrea Graziadei
- Bioanalytics, Institute of Biotechnology, Technische Universität Berlin, 10623 Berlin, Germany; Human Technopole, 20157 Milan, Italy
| | - Ezra Peisach
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Arthur Zalevsky
- Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, the Quantitative Biosciences Institute (QBI), and the Research Collaboratory for Structural Bioinformatics Protein Data Bank, University of California, San Francisco, San Francisco, CA 94157, USA
| | - Jared Sagendorf
- Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, the Quantitative Biosciences Institute (QBI), and the Research Collaboratory for Structural Bioinformatics Protein Data Bank, University of California, San Francisco, San Francisco, CA 94157, USA
| | - Hongsuda Tangmunarunkit
- Information Sciences Institute, Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA
| | - Serban Voinea
- Information Sciences Institute, Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA
| | - Monica Sekharan
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Jian Yu
- Protein Data Bank Japan, Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Alexander A M J J Bonvin
- Bijvoet Centre for Biomolecular Research, Faculty of Science - Chemistry, Utrecht University, Padualaan 8, 3584 CH Utrecht, the Netherlands
| | - Frank DiMaio
- Department of Biochemistry and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Gerhard Hummer
- Department of Theoretical Biophysics, Max Planck Institute of Biophysics, 60438 Frankfurt am Main, Germany; Institute for Biophysics, Goethe University Frankfurt, 60438 Frankfurt am Main, Germany
| | - Jens Meiler
- Center for Structural Biology, Vanderbilt University, 465 21st Avenue South, Nashville, TN 37221, USA; Institute for Drug Discovery, Leipzig University Medical School, 04103 Leipzig, Germany
| | - Emad Tajkhorshid
- NIH Resource for Macromolecular Modeling and Visualization, Beckman Institute for Advanced Science and Technology, Department of Biochemistry, and Center for Biophysics and Quantitative Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Thomas E Ferrin
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158, USA
| | - Catherine L Lawson
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Alexander Leitner
- Department of Biology, Institute of Molecular Systems Biology, ETH Zurich, 8093 Zurich, Switzerland
| | - Juri Rappsilber
- Bioanalytics, Institute of Biotechnology, Technische Universität Berlin, 10623 Berlin, Germany; Wellcome Centre for Cell Biology, University of Edinburgh, Max Born Crescent, Edinburgh EH9 3BF, UK
| | - Claus A M Seidel
- Molecular Physical Chemistry, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany
| | - Cy M Jeffries
- European Molecular Biology Laboratory (EMBL), Hamburg Unit, c/o Deutsches Elektronen-Synchrotron (DESY), Notkestrasse 85, 22607 Hamburg, Germany
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA; Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Jeffrey C Hoch
- Biological Magnetic Resonance Data Bank, Department of Molecular Biology and Biophysics, University of Connecticut, Farmington, CT 06030-3305, USA
| | - Genji Kurisu
- Protein Data Bank Japan, Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Kyle Morris
- Electron Microscopy Data Bank, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Ardan Patwardhan
- Electron Microscopy Data Bank, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Torsten Schwede
- Biozentrum, University of Basel, Basel, Switzerland; Computational Structural Biology & SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Jill Trewhella
- School of Life and Environmental Sciences, The University of Sydney, Sydney, NSW 2006, Australia; Department of Chemistry, University of Utah, Salt Lake City, UT 84112, USA
| | - Carl Kesselman
- Information Sciences Institute, Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA
| | - Helen M Berman
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Department of Quantitative and Computational Biology, University of Southern California, Los Angeles CA 90089, USA
| | - Andrej Sali
- Department of Bioengineering and Therapeutic Sciences, Department of Pharmaceutical Chemistry, the Quantitative Biosciences Institute (QBI), and the Research Collaboratory for Structural Bioinformatics Protein Data Bank, University of California, San Francisco, San Francisco, CA 94157, USA
| |
Collapse
|
5
|
Gogal RA, Nessler AJ, Thiel AC, Bernabe HV, Corrigan Grove RA, Cousineau LM, Litman JM, Miller JM, Qi G, Speranza MJ, Tollefson MR, Fenn TD, Michaelson JJ, Okada O, Piquemal JP, Ponder JW, Shen J, Smith RJH, Yang W, Ren P, Schnieders MJ. Force Field X: A computational microscope to study genetic variation and organic crystals using theory and experiment. J Chem Phys 2024; 161:012501. [PMID: 38958156 PMCID: PMC11223778 DOI: 10.1063/5.0214652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 06/17/2024] [Indexed: 07/04/2024] Open
Abstract
Force Field X (FFX) is an open-source software package for atomic resolution modeling of genetic variants and organic crystals that leverages advanced potential energy functions and experimental data. FFX currently consists of nine modular packages with novel algorithms that include global optimization via a many-body expansion, acid-base chemistry using polarizable constant-pH molecular dynamics, estimation of free energy differences, generalized Kirkwood implicit solvent models, and many more. Applications of FFX focus on the use and development of a crystal structure prediction pipeline, biomolecular structure refinement against experimental datasets, and estimation of the thermodynamic effects of genetic variants on both proteins and nucleic acids. The use of Parallel Java and OpenMM combines to offer shared memory, message passing, and graphics processing unit parallelization for high performance simulations. Overall, the FFX platform serves as a computational microscope to study systems ranging from organic crystals to solvated biomolecular systems.
Collapse
Affiliation(s)
- Rose A. Gogal
- Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA
| | - Aaron J. Nessler
- Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA
| | - Andrew C. Thiel
- Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA
| | - Hernan V. Bernabe
- Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA
| | - Rae A. Corrigan Grove
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87545, USA
| | - Leah M. Cousineau
- Department of Biochemistry and Molecular Biology, University of Iowa, Iowa City, Iowa 52242, USA
| | - Jacob M. Litman
- Department of Biochemistry and Molecular Biology, University of Iowa, Iowa City, Iowa 52242, USA
| | - Jacob M. Miller
- Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA
| | - Guowei Qi
- Department of Biochemistry and Molecular Biology, University of Iowa, Iowa City, Iowa 52242, USA
| | - Matthew J. Speranza
- Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA
| | - Mallory R. Tollefson
- Roy J. Carver Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa 52242, USA
| | - Timothy D. Fenn
- Analytical Development, LEXEO Therapeutics, New York, New York 10010, USA
| | - Jacob J. Michaelson
- Department of Psychiatry, University of Iowa Hospitals and Clinics, Iowa City, Iowa 52242, USA
| | - Okimasa Okada
- Sohyaku Innovative Research Division, Mitsubishi Tanabe Pharma Corporation, 1000 Kamoshida-cho, Aoba-ku, Yokohama, Kanagawa 227-0033, Japan
| | | | - Jay W. Ponder
- Department of Chemistry, Washington University in St. Louis, St. Louis, Missouri 63130, USA
| | - Jana Shen
- Department of Pharmaceutical Sciences, University of Maryland School of Pharmacy, Baltimore, Maryland 21201, USA
| | - Richard J. H. Smith
- Molecular Otolaryngology and Renal Research Laboratories, Department of Otolaryngology, University of Iowa Hospitals and Clinics, Iowa City, Iowa 52242, USA
| | | | - Pengyu Ren
- Department of Biomedical Engineering, University of Texas, Austin, Texas 78712, USA
| | | |
Collapse
|
6
|
Bittrich S, Midlik A, Varadi M, Velankar S, Burley SK, Young JY, Sehnal D, Vallat B. Describing and Sharing Molecular Visualizations Using the MolViewSpec Toolkit. Curr Protoc 2024; 4:e1099. [PMID: 39024028 PMCID: PMC11338654 DOI: 10.1002/cpz1.1099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
With the ever-expanding toolkit of molecular viewers, the ability to visualize macromolecular structures has never been more accessible. Yet, the idiosyncratic technical intricacies across tools and the integration complexities associated with handling structure annotation data present significant barriers to seamless interoperability and steep learning curves for many users. The necessity for reproducible data visualizations is at the forefront of the current challenges. Recently, we introduced MolViewSpec (homepage: https://molstar.org/mol-view-spec/, GitHub project: https://github.com/molstar/mol-view-spec), a specification approach that defines molecular visualizations, decoupling them from the varying implementation details of different molecular viewers. Through the protocols presented herein, we demonstrate how to use MolViewSpec and its 3D view-building Python library for creating sophisticated, customized 3D views covering all standard molecular visualizations. MolViewSpec supports representations like cartoon and ball-and-stick with coloring, labeling, and applying complex transformations such as superposition to any macromolecular structure file in mmCIF, BinaryCIF, and PDB formats. These examples showcase progress towards reusability and interoperability of molecular 3D visualization in an era when handling molecular structures at scale is a timely and pressing matter in structural bioinformatics as well as research and education across the life sciences. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Creating a MolViewSpec view using the MolViewSpec Python package Basic Protocol 2: Creating a MolViewSpec view with reference to MolViewSpec annotation files Basic Protocol 3: Creating a MolViewSpec view with labels and other advanced features Support Protocol 1: Computing rotation and translation vectors Support Protocol 2: Creating a MolViewSpec annotation file.
Collapse
Affiliation(s)
- Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California
- These authors contributed equally to this work
| | - Adam Midlik
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
- These authors contributed equally to this work
| | - Mihaly Varadi
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
| | - Stephen K. Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, New Jersey
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, New Jersey
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, New Jersey
| | - Jasmine Y. Young
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, New Jersey
| | - David Sehnal
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Brinda Vallat
- Research Collaboratory for Structural Bioinformatics Protein Data Bank and the Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, New Jersey
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, New Jersey
| |
Collapse
|
7
|
Deorowicz S, Gudyś A. Efficient protein structure archiving using ProteStAr. Bioinformatics 2024; 40:btae428. [PMID: 38984796 PMCID: PMC11239224 DOI: 10.1093/bioinformatics/btae428] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 06/11/2024] [Accepted: 07/08/2024] [Indexed: 07/11/2024] Open
Abstract
MOTIVATION The introduction of Deep Minds' Alpha Fold 2 enabled the prediction of protein structures at an unprecedented scale. AlphaFold Protein Structure Database and ESM Metagenomic Atlas contain hundreds of millions of structures stored in CIF and/or PDB formats. When compressed with a general-purpose utility like gzip, this translates to tens of terabytes of data, which hinders the effective use of predicted structures in large-scale analyses. RESULTS Here, we present ProteStAr, a compressor dedicated to CIF/PDB, as well as supplementary PAE files. Its main contribution is a novel approach to predicting atom coordinates on the basis of the previously analyzed atoms. This allows efficient encoding of the coordinates, the largest component of the protein structure files. The compression is lossless by default, though the lossy mode with a controlled maximum error of coordinates reconstruction is also present. Compared to the competing packages, i.e. BinaryCIF, Foldcomp, PDC, our approach offers a superior compression ratio at established reconstruction accuracy. By the efficient use of threads at both compression and decompression stages, the algorithm takes advantage of the multicore architecture of current central processing units and operates with speeds of about 1 GB/s. The presence of Python and C++ API further increases the usability of the presented method. AVAILABILITY AND IMPLEMENTATION The source code of ProteStAr is available at https://github.com/refresh-bio/protestar.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Department of Algorithmics and Software, Silesian University of Technology, Akademicka 16, Gliwice, PL-44100, Poland
| | - Adam Gudyś
- Department of Algorithmics and Software, Silesian University of Technology, Akademicka 16, Gliwice, PL-44100, Poland
| |
Collapse
|
8
|
Bittrich S, Segura J, Duarte JM, Burley SK, Rose Y. RCSB protein Data Bank: exploring protein 3D similarities via comprehensive structural alignments. Bioinformatics 2024; 40:btae370. [PMID: 38870521 PMCID: PMC11212067 DOI: 10.1093/bioinformatics/btae370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 05/15/2024] [Accepted: 06/11/2024] [Indexed: 06/15/2024] Open
Abstract
MOTIVATION Tools for pairwise alignments between 3D structures of proteins are of fundamental importance for structural biology and bioinformatics, enabling visual exploration of evolutionary and functional relationships. However, the absence of a user-friendly, browser-based tool for creating alignments and visualizing them at both 1D sequence and 3D structural levels makes this process unnecessarily cumbersome. RESULTS We introduce a novel pairwise structure alignment tool (rcsb.org/alignment) that seamlessly integrates into the RCSB Protein Data Bank (RCSB PDB) research-focused RCSB.org web portal. Our tool and its underlying application programming interface (alignment.rcsb.org) empowers users to align several protein chains with a reference structure by providing access to established alignment algorithms (FATCAT, CE, TM-align, or Smith-Waterman 3D). The user-friendly interface simplifies parameter setup and input selection. Within seconds, our tool enables visualization of results in both sequence (1D) and structural (3D) perspectives through the RCSB PDB RCSB.org Sequence Annotations viewer and Mol* 3D viewer, respectively. Users can effortlessly compare structures deposited in the PDB archive alongside more than a million incorporated Computed Structure Models coming from the ModelArchive and AlphaFold DB. Moreover, this tool can be used to align custom structure data by providing a link/URL or uploading atomic coordinate files directly. Importantly, alignment results can be bookmarked and shared with collaborators. By bridging the gap between 1D sequence and 3D structures of proteins, our tool facilitates deeper understanding of complex evolutionary relationships among proteins through comprehensive sequence and structural analyses. AVAILABILITY AND IMPLEMENTATION The alignment tool is part of the RCSB PDB research-focused RCSB.org web portal and available at rcsb.org/alignment. Programmatic access is available via alignment.rcsb.org. Frontend code has been published at github.com/rcsb/rcsb-pecos-app. Visualization is powered by the open-source Mol* viewer (github.com/molstar/molstar and github.com/molstar/rcsb-molstar) plus the Sequence Annotations in 3D Viewer (github.com/rcsb/rcsb-saguaro-3d).
Collapse
Affiliation(s)
- Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, United States
| | - Joan Segura
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, United States
| | - Jose M Duarte
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, United States
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, United States
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, United States
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, United States
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, United States
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, United States
| | - Yana Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, United States
| |
Collapse
|
9
|
Bitencourt-Ferreira G, Villarreal MA, Quiroga R, Biziukova N, Poroikov V, Tarasova O, de Azevedo Junior WF. Exploring Scoring Function Space: Developing Computational Models for Drug Discovery. Curr Med Chem 2024; 31:2361-2377. [PMID: 36944627 DOI: 10.2174/0929867330666230321103731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 12/15/2022] [Accepted: 12/29/2022] [Indexed: 03/23/2023]
Abstract
BACKGROUND The idea of scoring function space established a systems-level approach to address the development of models to predict the affinity of drug molecules by those interested in drug discovery. OBJECTIVE Our goal here is to review the concept of scoring function space and how to explore it to develop machine learning models to address protein-ligand binding affinity. METHODS We searched the articles available in PubMed related to the scoring function space. We also utilized crystallographic structures found in the protein data bank (PDB) to represent the protein space. RESULTS The application of systems-level approaches to address receptor-drug interactions allows us to have a holistic view of the process of drug discovery. The scoring function space adds flexibility to the process since it makes it possible to see drug discovery as a relationship involving mathematical spaces. CONCLUSION The application of the concept of scoring function space has provided us with an integrated view of drug discovery methods. This concept is useful during drug discovery, where we see the process as a computational search of the scoring function space to find an adequate model to predict receptor-drug binding affinity.
Collapse
Affiliation(s)
| | - Marcos A Villarreal
- CONICET-Departamento de Matemática y Física, Instituto de Investigaciones en Fisicoquímica de Córdoba (INFIQC), Facultad de Ciencias Químicas, Universidad Nacional de Córdoba, Ciudad Universitaria, Córdoba, Argentina
| | - Rodrigo Quiroga
- CONICET-Departamento de Matemática y Física, Instituto de Investigaciones en Fisicoquímica de Córdoba (INFIQC), Facultad de Ciencias Químicas, Universidad Nacional de Córdoba, Ciudad Universitaria, Córdoba, Argentina
| | - Nadezhda Biziukova
- Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow, 119121, Russia
| | - Vladimir Poroikov
- Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow, 119121, Russia
| | - Olga Tarasova
- Institute of Biomedical Chemistry, Pogodinskaya Str., 10/8, Moscow, 119121, Russia
| | - Walter F de Azevedo Junior
- Pontifical Catholic University of Rio Grande do Sul - PUCRS, Porto Alegre-RS, Brazil
- Specialization Program in Bioinformatics, The Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga, 6681 Porto Alegre / RS 90619-900, Brazil
| |
Collapse
|
10
|
Staniscia L, Yu YW. Image-centric compression of protein structures improves space savings. BMC Bioinformatics 2023; 24:437. [PMID: 37990290 PMCID: PMC10664254 DOI: 10.1186/s12859-023-05570-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Accepted: 11/15/2023] [Indexed: 11/23/2023] Open
Abstract
BACKGROUND Because of the rapid generation of data, the study of compression algorithms to reduce storage and transmission costs is important to bioinformaticians. Much of the focus has been on sequence data, including both genomes and protein amino acid sequences stored in FASTA files. Current standard practice is to use an ordinary lossless compressor such as gzip on a sequential list of atomic coordinates, but this approach expends bits on saving an arbitrary ordering of atoms, and it also prevents reordering the atoms for compressibility. The standard MMTF and BCIF file formats extend this approach with custom encoding of the coordinates. However, the brand new Foldcomp tool introduces a new paradigm of compressing local angles, to great effect. In this article, we explore a different paradigm, showing for the first time that image-based compression using global angles can also significantly improve compression ratios. To this end, we implement a prototype compressor 'PIC', specialized for point clouds of atom coordinates contained in PDB and mmCIF files. PIC maps the 3D data to a 2D 8-bit greyscale image and leverages the well developed PNG image compressor to minimize the size of the resulting image, forming the compressed file. RESULTS PIC outperforms gzip in terms of compression ratio on proteins over 20,000 atoms in size, with a savings over gzip of up to 37.4% on the proteins compressed. In addition, PIC's compression ratio increases with protein size. CONCLUSION Image-centric compression as demonstrated by our prototype PIC provides a potential means of constructing 3D structure-aware protein compression software, though future work would be necessary to make this practical.
Collapse
Affiliation(s)
- Luke Staniscia
- Department of Mathematics, University of Toronto, Toronto, ON, Canada
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, ON, Canada.
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
11
|
Vallat B, Tauriello G, Bienert S, Haas J, Webb BM, Žídek A, Zheng W, Peisach E, Piehl DW, Anischanka I, Sillitoe I, Tolchard J, Varadi M, Baker D, Orengo C, Zhang Y, Hoch JC, Kurisu G, Patwardhan A, Velankar S, Burley SK, Sali A, Schwede T, Berman HM, Westbrook JD. ModelCIF: An Extension of PDBx/mmCIF Data Representation for Computed Structure Models. J Mol Biol 2023; 435:168021. [PMID: 36828268 PMCID: PMC10293049 DOI: 10.1016/j.jmb.2023.168021] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 02/15/2023] [Accepted: 02/16/2023] [Indexed: 02/24/2023]
Abstract
ModelCIF (github.com/ihmwg/ModelCIF) is a data information framework developed for and by computational structural biologists to enable delivery of Findable, Accessible, Interoperable, and Reusable (FAIR) data to users worldwide. ModelCIF describes the specific set of attributes and metadata associated with macromolecular structures modeled by solely computational methods and provides an extensible data representation for deposition, archiving, and public dissemination of predicted three-dimensional (3D) models of macromolecules. It is an extension of the Protein Data Bank Exchange / macromolecular Crystallographic Information Framework (PDBx/mmCIF), which is the global data standard for representing experimentally-determined 3D structures of macromolecules and associated metadata. The PDBx/mmCIF framework and its extensions (e.g., ModelCIF) are managed by the Worldwide Protein Data Bank partnership (wwPDB, wwpdb.org) in collaboration with relevant community stakeholders such as the wwPDB ModelCIF Working Group (wwpdb.org/task/modelcif). This semantically rich and extensible data framework for representing computed structure models (CSMs) accelerates the pace of scientific discovery. Herein, we describe the architecture, contents, and governance of ModelCIF, and tools and processes for maintaining and extending the data standard. Community tools and software libraries that support ModelCIF are also described.
Collapse
Affiliation(s)
- Brinda Vallat
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA.
| | - Gerardo Tauriello
- Biozentrum, University of Basel, Basel, Switzerland; Computational Structural Biology, SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Stefan Bienert
- Biozentrum, University of Basel, Basel, Switzerland; Computational Structural Biology, SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Juergen Haas
- Biozentrum, University of Basel, Basel, Switzerland; Computational Structural Biology, SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Benjamin M Webb
- Department of Bioengineering and Therapeutic Sciences, the Quantitative Biosciences Institute (QBI), and the Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA 94157, USA
| | | | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Ezra Peisach
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Dennis W Piehl
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Ivan Anischanka
- Department of Biochemistry, and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Ian Sillitoe
- Department of Structural and Molecular Biology, UCL, London, UK
| | - James Tolchard
- AlphaFold Protein Structure Database, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK; Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Mihaly Varadi
- AlphaFold Protein Structure Database, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK; Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - David Baker
- Department of Biochemistry, and Institute for Protein Design, University of Washington, Seattle, WA 98195, USA; Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| | | | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jeffrey C Hoch
- Biological Magnetic Resonance Data Bank, Department of Molecular Biology and Biophysics, University of Connecticut, Farmington, CT 06030, USA
| | - Genji Kurisu
- Protein Data Bank Japan, Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Ardan Patwardhan
- Electron Microscopy Data Bank, European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, UK
| | - Sameer Velankar
- AlphaFold Protein Structure Database, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK; Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA; Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Andrej Sali
- Department of Bioengineering and Therapeutic Sciences, the Quantitative Biosciences Institute (QBI), and the Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA 94157, USA. https://twitter.com/salilab_ucsf
| | - Torsten Schwede
- Biozentrum, University of Basel, Basel, Switzerland; Computational Structural Biology, SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Helen M Berman
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - John D Westbrook
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| |
Collapse
|
12
|
Bittrich S, Bhikadiya C, Bi C, Chao H, Duarte JM, Dutta S, Fayazi M, Henry J, Khokhriakov I, Lowe R, Piehl DW, Segura J, Vallat B, Voigt M, Westbrook JD, Burley SK, Rose Y. RCSB Protein Data Bank: Efficient Searching and Simultaneous Access to One Million Computed Structure Models Alongside the PDB Structures Enabled by Architectural Advances. J Mol Biol 2023; 435:167994. [PMID: 36738985 PMCID: PMC11514064 DOI: 10.1016/j.jmb.2023.167994] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Revised: 01/27/2023] [Accepted: 01/28/2023] [Indexed: 02/05/2023]
Abstract
The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) provides open access to experimentally-determined three-dimensional (3D) structures of biomolecules. The RCSB PDB RCSB.org research-focused web portal is used annually by many millions of users around the world. They access biostructure information, run complex queries utilizing various search services (e.g., full-text, structural and chemical attribute, chemical, sequence, and structure similarity searches), and visualize macromolecules in 3D, all at no charge and with no limitations on data usage. Notwithstanding more than 24,000-fold growth of the PDB over the past five decades, experimentally-determined structures are only available for a small subset of the millions of proteins of known sequence. Recently developed machine learning software tools can predict 3D structures of proteins at accuracies comparable to lower-resolution experimental methods. The RCSB PDB now provides access to ∼1,000,000 Computed Structure Models (CSMs) of proteins coming from AlphaFold DB and the ModelArchive alongside ∼200,000 experimentally-determined PDB structures. Both CSMs and PDB structures are available on RCSB.org and via well-established RCSB PDB Data, Search, and 1D-Coordinates application programming interfaces (APIs). Simultaneous delivery of PDB data and CSMs provides users with access to complementary structural information across the human proteome and those of model organisms and selected pathogens. API enhancements are backwards-compatible and programmatic users can "opt in" to access CSMs with minimal effort. Herein, we describe modifications to RCSB PDB cyberinfrastructure required to support sixfold scaling of 3D biostructure data delivery and lay the groundwork for scaling to accommodate hundreds of millions of CSMs.
Collapse
Affiliation(s)
- Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA.
| | - Charmi Bhikadiya
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Chunxiao Bi
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Henry Chao
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Jose M Duarte
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Shuchismita Dutta
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| | - Maryam Fayazi
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Jeremy Henry
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Igor Khokhriakov
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Robert Lowe
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Dennis W Piehl
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Joan Segura
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Brinda Vallat
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| | - Maria Voigt
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - John D Westbrook
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA; Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Yana Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| |
Collapse
|
13
|
Choudhary P, Anyango S, Berrisford J, Tolchard J, Varadi M, Velankar S. Unified access to up-to-date residue-level annotations from UniProtKB and other biological databases for PDB data. Sci Data 2023; 10:204. [PMID: 37045837 PMCID: PMC10097656 DOI: 10.1038/s41597-023-02101-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 03/23/2023] [Indexed: 04/14/2023] Open
Abstract
More than 61,000 proteins have up-to-date correspondence between their amino acid sequence (UniProtKB) and their 3D structures (PDB), enabled by the Structure Integration with Function, Taxonomy and Sequences (SIFTS) resource. SIFTS incorporates residue-level annotations from many other biological resources. SIFTS data is available in various formats like XML, CSV and TSV format or also accessible via the PDBe REST API but always maintained separately from the structure data (PDBx/mmCIF file) in the PDB archive. Here, we extended the wwPDB PDBx/mmCIF data dictionary with additional categories to accommodate SIFTS data and added the UniProtKB, Pfam, SCOP2, and CATH residue-level annotations directly into the PDBx/mmCIF files from the PDB archive. With the integrated UniProtKB annotations, these files now provide consistent numbering of residues in different PDB entries allowing easy comparison of structure models. The extended dictionary yields a more consistent, standardised metadata description without altering the core PDB information. This development enables up-to-date cross-reference information at the residue level resulting in better data interoperability, supporting improved data analysis and visualisation.
Collapse
Grants
- BB/V004247/1, PI:Sameer Velankar RCUK | Biotechnology and Biological Sciences Research Council (BBSRC)
- BB/V004247/1, PI:Sameer Velankar RCUK | Biotechnology and Biological Sciences Research Council (BBSRC)
- BB/V004247/1, PI:Sameer Velankar RCUK | Biotechnology and Biological Sciences Research Council (BBSRC)
- BB/V004247/1, PI:Sameer Velankar RCUK | Biotechnology and Biological Sciences Research Council (BBSRC)
- BB/V004247/1, PI:Sameer Velankar RCUK | Biotechnology and Biological Sciences Research Council (BBSRC)
- BB/V004247/1, PI:Sameer Velankar RCUK | Biotechnology and Biological Sciences Research Council (BBSRC)
- DBI-2019297, PI: S.K. Burley National Science Foundation (NSF)
- DBI-2019297, PI: S.K. Burley National Science Foundation (NSF)
- DBI-2019297, PI: S.K. Burley) National Science Foundation (NSF)
- DBI-2019297, PI: S.K. Burley National Science Foundation (NSF)
- DBI-2019297, PI: S.K. Burley National Science Foundation (NSF)
- DBI-2019297, PI: S.K. Burley NSF | National Science Board (NSB)
Collapse
Affiliation(s)
- Preeti Choudhary
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Stephen Anyango
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - John Berrisford
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- AstraZeneca, Biomedical Campus, 1 Francis Crick Ave, Trumpington, Cambridge, CB2 0AA, UK
| | - James Tolchard
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
- Claude Bernard University, Villeurbanne, Lyon, 69100, France
| | - Mihaly Varadi
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| |
Collapse
|
14
|
Zhang C, Pyle AM. PDC: a highly compact file format to store protein 3D coordinates. Database (Oxford) 2023; 2023:baad018. [PMID: 37010520 PMCID: PMC10069377 DOI: 10.1093/database/baad018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 03/01/2023] [Accepted: 03/07/2023] [Indexed: 04/04/2023]
Abstract
Recent improvements in computational and experimental techniques for obtaining protein structures have resulted in an explosion of 3D coordinate data. To cope with the ever-increasing sizes of structure databases, this work proposes the Protein Data Compression (PDC) format, which compresses coordinates and temperature factors of full-atomic and Cα-only protein structures. Without loss of precision, PDC results in 69% to 78% smaller file sizes than Protein Data Bank (PDB) and macromolecular Crystallographic Information File (mmCIF) files with standard GZIP compression. It uses ∼60% less space than existing compression algorithms specific to macromolecular structures. PDC optionally performs lossy compression with minimal sacrifice of precision, which allows reduction of file sizes by another 79%. Conversion between PDC, mmCIF and PDB formats is typically achieved within 0.02 s. The compactness and fast reading/writing speed of PDC make it valuable for storage and analysis of large quantity of tertiary structural data. Database URL https://github.com/kad-ecoli/pdc.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Av, Ann Arbor, MI 48109, USA
- Howard Hughes Medical Institute, 4000 Jones Bridge Rd, Chevy Chase, MD 20815, USA
- Department of Molecular, Cellular, and Developmental Biology, Yale University, 266 Whitney Av, New Haven, CT 06511, USA
| | - Anna Marie Pyle
- Howard Hughes Medical Institute, 4000 Jones Bridge Rd, Chevy Chase, MD 20815, USA
- Department of Molecular, Cellular, and Developmental Biology, Yale University, 266 Whitney Av, New Haven, CT 06511, USA
- Department of Chemistry, Yale University, 225 Prospect St, New Haven, CT 06511, USA
| |
Collapse
|
15
|
Kim H, Mirdita M, Steinegger M. Foldcomp: a library and format for compressing and indexing large protein structure sets. Bioinformatics 2023; 39:btad153. [PMID: 36961332 PMCID: PMC10085514 DOI: 10.1093/bioinformatics/btad153] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/17/2023] [Accepted: 03/19/2023] [Indexed: 03/25/2023] Open
Abstract
SUMMARY Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By using a combination of internal and Cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of three compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is five times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analysing large collections of protein structures. AVAILABILITY AND IMPLEMENTATION Foldcomp is a free open-source software (GPLv3) and available for Linux, macOS, and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB), and ESMatlas HQ (114GB) database ready-for-download.
Collapse
Affiliation(s)
- Hyunbin Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
| | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Seoul 08826, South Korea
| | - Martin Steinegger
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
- School of Biological Sciences, Seoul National University, Seoul 08826, South Korea
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul 08826, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul 08826, South Korea
| |
Collapse
|
16
|
Westbrook JD, Young JY, Shao C, Feng Z, Guranovic V, Lawson CL, Vallat B, Adams PD, Berrisford JM, Bricogne G, Diederichs K, Joosten RP, Keller P, Moriarty NW, Sobolev OV, Velankar S, Vonrhein C, Waterman DG, Kurisu G, Berman HM, Burley SK, Peisach E. PDBx/mmCIF Ecosystem: Foundational Semantic Tools for Structural Biology. J Mol Biol 2022; 434:167599. [PMID: 35460671 DOI: 10.1016/j.jmb.2022.167599] [Citation(s) in RCA: 37] [Impact Index Per Article: 18.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 03/31/2022] [Accepted: 04/13/2022] [Indexed: 02/07/2023]
Abstract
PDBx/mmCIF, Protein Data Bank Exchange (PDBx) macromolecular Crystallographic Information Framework (mmCIF), has become the data standard for structural biology. With its early roots in the domain of small-molecule crystallography, PDBx/mmCIF provides an extensible data representation that is used for deposition, archiving, remediation, and public dissemination of experimentally determined three-dimensional (3D) structures of biological macromolecules by the Worldwide Protein Data Bank (wwPDB, wwpdb.org). Extensions of PDBx/mmCIF are similarly used for computed structure models by ModelArchive (modelarchive.org), integrative/hybrid structures by PDB-Dev (pdb-dev.wwpdb.org), small angle scattering data by Small Angle Scattering Biological Data Bank SASBDB (sasbdb.org), and for models computed generated with the AlphaFold 2.0 deep learning software suite (alphafold.ebi.ac.uk). Community-driven development of PDBx/mmCIF spans three decades, involving contributions from researchers, software and methods developers in structural sciences, data repository providers, scientific publishers, and professional societies. Having a semantically rich and extensible data framework for representing a wide range of structural biology experimental and computational results, combined with expertly curated 3D biostructure data sets in public repositories, accelerates the pace of scientific discovery. Herein, we describe the architecture of the PDBx/mmCIF data standard, tools used to maintain representations of the data standard, governance, and processes by which data content standards are extended, plus community tools/software libraries available for processing and checking the integrity of PDBx/mmCIF data. Use cases exemplify how the members of the Worldwide Protein Data Bank have used PDBx/mmCIF as the foundation for its pipeline for delivering Findable, Accessible, Interoperable, and Reusable (FAIR) data to many millions of users worldwide.
Collapse
Affiliation(s)
- John D Westbrook
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| | - Jasmine Y Young
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Chenghua Shao
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Zukang Feng
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Vladimir Guranovic
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Catherine L Lawson
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Brinda Vallat
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Paul D Adams
- Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Department of Bioengineering, University of California at Berkeley, Berkeley, CA 94720, USA
| | - John M Berrisford
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Gerard Bricogne
- Global Phasing Ltd, Sheraton House, Castle Park, Cambridge CB3 0AK, UK
| | | | - Robbie P Joosten
- Department of Biochemistry, Netherlands Cancer Institute, Amsterdam, the Netherlands; Oncode Institute, 3521 AL Utrecht, the Netherlands. https://www.twitter.com/Robbie_Joosten
| | - Peter Keller
- Global Phasing Ltd, Sheraton House, Castle Park, Cambridge CB3 0AK, UK
| | - Nigel W Moriarty
- Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Oleg V Sobolev
- Molecular Biophysics and Integrated Bioimaging, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Clemens Vonrhein
- Global Phasing Ltd, Sheraton House, Castle Park, Cambridge CB3 0AK, UK
| | - David G Waterman
- UKRI-STFC Rutherford Appleton Laboratory, Didcot OX11 0FA, UK; CCP4, Research Complex at Harwell, Rutherford Appleton Laboratory, Didcot OX11 0FA, UK. https://www.twitter.com/upintheair
| | - Genji Kurisu
- Protein Data Bank Japan, Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Helen M Berman
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; The Bridge Institute, Michelson Center for Convergent Bioscience, University of Southern California, Los Angeles, CA, USA
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA.
| | - Ezra Peisach
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA.
| |
Collapse
|
17
|
Sehnal D, Bittrich S, Deshpande M, Svobodová R, Berka K, Bazgier V, Velankar S, Burley SK, Koča J, Rose AS. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res 2021; 49:W431-W437. [PMID: 33956157 PMCID: PMC8262734 DOI: 10.1093/nar/gkab314] [Citation(s) in RCA: 516] [Impact Index Per Article: 172.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 04/12/2021] [Accepted: 04/26/2021] [Indexed: 12/31/2022] Open
Abstract
Large biomolecular structures are being determined experimentally on a daily basis using established techniques such as crystallography and electron microscopy. In addition, emerging integrative or hybrid methods (I/HM) are producing structural models of huge macromolecular machines and assemblies, sometimes containing 100s of millions of non-hydrogen atoms. The performance requirements for visualization and analysis tools delivering these data are increasing rapidly. Significant progress in developing online, web-native three-dimensional (3D) visualization tools was previously accomplished with the introduction of the LiteMol suite and NGL Viewers. Thereafter, Mol* development was jointly initiated by PDBe and RCSB PDB to combine and build on the strengths of LiteMol (developed by PDBe) and NGL (developed by RCSB PDB). The web-native Mol* Viewer enables 3D visualization and streaming of macromolecular coordinate and experimental data, together with capabilities for displaying structure quality, functional, or biological context annotations. High-performance graphics and data management allows users to simultaneously visualise up to hundreds of (superimposed) protein structures, stream molecular dynamics simulation trajectories, render cell-level models, or display huge I/HM structures. It is the primary 3D structure viewer used by PDBe and RCSB PDB. It can be easily integrated into third-party services. Mol* Viewer is open source and freely available at https://molstar.org/.
Collapse
Affiliation(s)
- David Sehnal
- CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic.,Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics (RCSB), San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093-0743, USA
| | - Mandar Deshpande
- Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Radka Svobodová
- CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
| | - Karel Berka
- Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic
| | - Václav Bazgier
- Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic
| | - Sameer Velankar
- Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854-8076, USA.,Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08903-2681, USA.,Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), San Diego Supercomputer Center and Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, San Diego, CA 92093-0654, USA
| | - Jaroslav Koča
- CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
| | - Alexander S Rose
- Research Collaboratory for Structural Bioinformatics (RCSB), San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093-0743, USA
| |
Collapse
|
18
|
Rose Y, Duarte JM, Lowe R, Segura J, Bi C, Bhikadiya C, Chen L, Rose AS, Bittrich S, Burley SK, Westbrook JD. RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. J Mol Biol 2021; 433:166704. [PMID: 33186584 PMCID: PMC9093041 DOI: 10.1016/j.jmb.2020.11.003] [Citation(s) in RCA: 94] [Impact Index Per Article: 31.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 11/03/2020] [Accepted: 11/05/2020] [Indexed: 11/10/2022]
Abstract
The US Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB) serves many millions of unique users worldwide by delivering experimentally-determined 3D structures of biomolecules integrated with >40 external data resources via RCSB.org, application programming interfaces (APIs), and FTP downloads. Herein, we present the architectural redesign of RCSB PDB data delivery services that build on existing PDBx/mmCIF data schemas. New data access APIs (data.rcsb.org) enable efficient delivery of all PDB archive data. A novel GraphQL-based API provides flexible, declarative data retrieval along with a simple-to-use REST API. A powerful new search system (search.rcsb.org) seamlessly integrates heterogeneous types of searches across the PDB archive. Searches may combine text attributes, protein or nucleic acid sequences, small-molecule chemical descriptors, 3D macromolecular shapes, and sequence motifs. The new RCSB.org architecture adheres to the FAIR Principles, empowering users to address a wide array of research problems in fundamental biology, biomedicine, biotechnology, bioengineering, and bioenergy.
Collapse
Affiliation(s)
- Yana Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Jose M Duarte
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Robert Lowe
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Joan Segura
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Chunxiao Bi
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Charmi Bhikadiya
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Li Chen
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Alexander S Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA; Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA; Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - John D Westbrook
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA.
| |
Collapse
|
19
|
Burley SK, Bhikadiya C, Bi C, Bittrich S, Chen L, Crichlow GV, Christie CH, Dalenberg K, Di Costanzo L, Duarte JM, Dutta S, Feng Z, Ganesan S, Goodsell DS, Ghosh S, Green RK, Guranović V, Guzenko D, Hudson BP, Lawson C, Liang Y, Lowe R, Namkoong H, Peisach E, Persikova I, Randle C, Rose A, Rose Y, Sali A, Segura J, Sekharan M, Shao C, Tao YP, Voigt M, Westbrook J, Young JY, Zardecki C, Zhuravleva M. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res 2021; 49:D437-D451. [PMID: 33211854 PMCID: PMC7779003 DOI: 10.1093/nar/gkaa1038] [Citation(s) in RCA: 834] [Impact Index Per Article: 278.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/14/2020] [Accepted: 11/17/2020] [Indexed: 12/14/2022] Open
Abstract
The Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), the US data center for the global PDB archive and a founding member of the Worldwide Protein Data Bank partnership, serves tens of thousands of data depositors in the Americas and Oceania and makes 3D macromolecular structure data available at no charge and without restrictions to millions of RCSB.org users around the world, including >660 000 educators, students and members of the curious public using PDB101.RCSB.org. PDB data depositors include structural biologists using macromolecular crystallography, nuclear magnetic resonance spectroscopy, 3D electron microscopy and micro-electron diffraction. PDB data consumers accessing our web portals include researchers, educators and students studying fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. During the past 2 years, the research-focused RCSB PDB web portal (RCSB.org) has undergone a complete redesign, enabling improved searching with full Boolean operator logic and more facile access to PDB data integrated with >40 external biodata resources. New features and resources are described in detail using examples that showcase recently released structures of SARS-CoV-2 proteins and host cell proteins relevant to understanding and addressing the COVID-19 global pandemic.
Collapse
Affiliation(s)
- Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Charmi Bhikadiya
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Chunxiao Bi
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Li Chen
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Gregg V Crichlow
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Cole H Christie
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Kenneth Dalenberg
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Luigi Di Costanzo
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Jose M Duarte
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Shuchismita Dutta
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| | - Zukang Feng
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Sai Ganesan
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Department of Biotherapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - David S Goodsell
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Center for Computational Structural Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Sutapa Ghosh
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Rachel Kramer Green
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Vladimir Guranović
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Dmytro Guzenko
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Brian P Hudson
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Catherine L Lawson
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Yuhe Liang
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Robert Lowe
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Harry Namkoong
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Ezra Peisach
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Irina Persikova
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Chris Randle
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Alexander Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Yana Rose
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Andrej Sali
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Department of Biotherapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Joan Segura
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, La Jolla, CA 92093, USA
| | - Monica Sekharan
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Chenghua Shao
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Yi-Ping Tao
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Maria Voigt
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - John D Westbrook
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08901, USA
| | - Jasmine Y Young
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Christine Zardecki
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Marina Zhuravleva
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| |
Collapse
|
20
|
Bittrich S, Burley SK, Rose AS. Real-time structural motif searching in proteins using an inverted index strategy. PLoS Comput Biol 2020; 16:e1008502. [PMID: 33284792 PMCID: PMC7746303 DOI: 10.1371/journal.pcbi.1008502] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 12/17/2020] [Accepted: 11/09/2020] [Indexed: 12/30/2022] Open
Abstract
Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins. Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. Existing methods depend on a clustering by sequence similarity and can be computationally slow. We have developed a new approach that uses an inverted index strategy capable of analyzing >170,000 PDB structures with unmatched speed. The efficiency of the inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach (implemented at motif.rcsb.org) enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user. Herein, we describe the method and present five case studies that exemplify its efficacy and speed for analyzing 3D structures of both proteins and nucleic acids. The Protein Data Bank (PDB) provides open access to more than 170,000 three-dimensional structures of proteins, nucleic acids, and biological complexes. Similarities between PDB structures give valuable functional and evolutionary insights but such resemblance may not be evident at sequence or global structure level. Throughout the database, there are recurring structural motifs—groups of modest numbers of residues in proximity that, for example, support catalytic activity. Identification of common structural motifs can reveal similarities between proteins and serve as fingerprints for spatial configurations of amino acids, such as the His-Asp-Ser catalytic triad found in serine proteases or the zinc coordination site found in Zinc Finger DNA-binding domains. We present a highly efficient yet flexible strategy that allows users for the first time to search for arbitrary structural motifs across the entire PDB archive in real-time. Our approach scales favorably with the increasing number and complexity of deposited structures, and, also, has the potential to be adapted for other applications in a macromolecular context.
Collapse
Affiliation(s)
- Sebastian Bittrich
- RCSB Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, USA
- * E-mail:
| | - Stephen K. Burley
- RCSB Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, USA
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, New Jersey, USA
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, California, USA
| | - Alexander S. Rose
- RCSB Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, USA
| |
Collapse
|