Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Valasatava Y, Bradley AR, Rose AS, Duarte JM, Prlić A, Rose PW. Towards an efficient compression of 3D coordinates of macromolecular structures. PLoS One 2017;12:e0174846. [PMID: 28362865 PMCID: PMC5376293 DOI: 10.1371/journal.pone.0174846] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 03/16/2017] [Indexed: 11/18/2022] Open

For:	Valasatava Y, Bradley AR, Rose AS, Duarte JM, Prlić A, Rose PW. Towards an efficient compression of 3D coordinates of macromolecular structures. PLoS One 2017;12:e0174846. [PMID: 28362865 PMCID: PMC5376293 DOI: 10.1371/journal.pone.0174846] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 03/16/2017] [Indexed: 11/18/2022] Open

Number

Cited by Other Article(s)

van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, Steinegger M. Fast and accurate protein structure search with Foldseek. Nat Biotechnol 2024;42:243-246. [PMID: 37156916 PMCID: PMC10869269 DOI: 10.1038/s41587-023-01773-0] [Citation(s) in RCA: 330] [Impact Index Per Article: 330.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Accepted: 03/30/2023] [Indexed: 05/10/2023]

Draizen EJ, Readey J, Mura C, Bourne PE. Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. BMC Bioinformatics 2024;25:11. [PMID: 38177985 PMCID: PMC10768222 DOI: 10.1186/s12859-023-05586-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 11/27/2023] [Indexed: 01/06/2024] Open

Abstract

BACKGROUND

Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing.

RESULTS

Here, we report 'Prop3D', a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a 'Prop3D-20sf' protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks.

CONCLUSION

Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf's construction explicitly takes into account (in creating datasets and data-splits) the enigma of 'data leakage', stemming from the evolutionary relationships between proteins.

Collapse

Staniscia L, Yu YW. Image-centric compression of protein structures improves space savings. BMC Bioinformatics 2023;24:437. [PMID: 37990290 PMCID: PMC10664254 DOI: 10.1186/s12859-023-05570-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Accepted: 11/15/2023] [Indexed: 11/23/2023] Open

Zhang C, Pyle AM. PDC: a highly compact file format to store protein 3D coordinates. Database (Oxford) 2023;2023:baad018. [PMID: 37010520 PMCID: PMC10069377 DOI: 10.1093/database/baad018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 03/01/2023] [Accepted: 03/07/2023] [Indexed: 04/04/2023]

Kim H, Mirdita M, Steinegger M. Foldcomp: a library and format for compressing and indexing large protein structure sets. Bioinformatics 2023;39:btad153. [PMID: 36961332 PMCID: PMC10085514 DOI: 10.1093/bioinformatics/btad153] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/17/2023] [Accepted: 03/19/2023] [Indexed: 03/25/2023] Open

Sehnal D, Bittrich S, Deshpande M, Svobodová R, Berka K, Bazgier V, Velankar S, Burley SK, Koča J, Rose AS. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res 2021;49:W431-W437. [PMID: 33956157 PMCID: PMC8262734 DOI: 10.1093/nar/gkab314] [Citation(s) in RCA: 501] [Impact Index Per Article: 167.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 04/12/2021] [Accepted: 04/26/2021] [Indexed: 12/31/2022] Open

Affiliation(s)

David Sehnal CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic.,Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
Sebastian Bittrich Research Collaboratory for Structural Bioinformatics (RCSB), San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093-0743, USA
Mandar Deshpande Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
Radka Svobodová CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
Karel Berka Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic
Václav Bazgier Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic
Sameer Velankar Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
Stephen K Burley Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854-8076, USA.,Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08903-2681, USA.,Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), San Diego Supercomputer Center and Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, San Diego, CA 92093-0654, USA
Jaroslav Koča CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
Alexander S Rose Research Collaboratory for Structural Bioinformatics (RCSB), San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093-0743, USA

Collapse

Bittrich S, Burley SK, Rose AS. Real-time structural motif searching in proteins using an inverted index strategy. PLoS Comput Biol 2020;16:e1008502. [PMID: 33284792 PMCID: PMC7746303 DOI: 10.1371/journal.pcbi.1008502] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 12/17/2020] [Accepted: 11/09/2020] [Indexed: 12/30/2022] Open

Abstract

Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins. Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. Existing methods depend on a clustering by sequence similarity and can be computationally slow. We have developed a new approach that uses an inverted index strategy capable of analyzing >170,000 PDB structures with unmatched speed. The efficiency of the inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach (implemented at motif.rcsb.org) enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user. Herein, we describe the method and present five case studies that exemplify its efficacy and speed for analyzing 3D structures of both proteins and nucleic acids.

The Protein Data Bank (PDB) provides open access to more than 170,000 three-dimensional structures of proteins, nucleic acids, and biological complexes. Similarities between PDB structures give valuable functional and evolutionary insights but such resemblance may not be evident at sequence or global structure level. Throughout the database, there are recurring structural motifs—groups of modest numbers of residues in proximity that, for example, support catalytic activity. Identification of common structural motifs can reveal similarities between proteins and serve as fingerprints for spatial configurations of amino acids, such as the His-Asp-Ser catalytic triad found in serine proteases or the zinc coordination site found in Zinc Finger DNA-binding domains. We present a highly efficient yet flexible strategy that allows users for the first time to search for arbitrary structural motifs across the entire PDB archive in real-time. Our approach scales favorably with the increasing number and complexity of deposited structures, and, also, has the potential to be adapted for other applications in a macromolecular context.

Collapse

BinaryCIF and CIFTools-Lightweight, efficient and extensible macromolecular data management. PLoS Comput Biol 2020;16:e1008247. [PMID: 33075050 PMCID: PMC7595629 DOI: 10.1371/journal.pcbi.1008247] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 10/29/2020] [Accepted: 08/14/2020] [Indexed: 02/07/2023] Open

Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlic A, Rose PW. NGL viewer: web-based molecular graphics for large complexes. Bioinformatics 2019;34:3755-3758. [PMID: 29850778 DOI: 10.1093/bioinformatics/bty419] [Citation(s) in RCA: 323] [Impact Index Per Article: 64.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Accepted: 05/22/2018] [Indexed: 12/25/2022] Open

Sehnal D, Deshpande M, Vařeková RS, Mir S, Berka K, Midlik A, Pravda L, Velankar S, Koča J. LiteMol suite: interactive web-based visualization of large-scale macromolecular structure data. Nat Methods 2019;14:1121-1122. [PMID: 29190272 DOI: 10.1038/nmeth.4499] [Citation(s) in RCA: 96] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]

Affiliation(s)

David Sehnal CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic.,Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
Mandar Deshpande Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
Radka Svobodová Vařeková CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
Saqib Mir Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
Karel Berka Regional Centre of Advanced Technologies and Materials, Department of Physical Chemistry, Faculty of Science, Palacký University, Olomouc, Czech Republic
Adam Midlik CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
Lukáš Pravda CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
Sameer Velankar Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
Jaroslav Koča CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic

Collapse

Lafita A, Bliven S, Prlić A, Guzenko D, Rose PW, Bradley A, Pavan P, Myers-Turnbull D, Valasatava Y, Heuer M, Larson M, Burley SK, Duarte JM. BioJava 5: A community driven open-source bioinformatics library. PLoS Comput Biol 2019;15:e1006791. [PMID: 30735498 PMCID: PMC6383946 DOI: 10.1371/journal.pcbi.1006791] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Revised: 02/21/2019] [Accepted: 01/13/2019] [Indexed: 11/19/2022] Open

Affiliation(s)

Aleix Lafita European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
Spencer Bliven Zurich University of Applied Sciences (ZHAW), Zurich CH-8021, Switzerland
Andreas Prlić San Diego Supercomputer Center, UCSD, San Diego, CA 92093, USA
Dmytro Guzenko Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA
Peter W. Rose Structural Bioinformatics Laboratory, San Diego Supercomputer Center, UCSD, San Diego, CA 92093, USA
Anthony Bradley Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK Department of Chemistry, University of Oxford, Oxford OX1 3TA, UK Diamond Light Source Ltd., Didcot OX11 0DE, UK
Paolo Pavan Genomnia srl, Bresso, Milan, Italy
Douglas Myers-Turnbull Institute for Neurodegenerative Diseases, University of California, San Francisco, CA 94143, USA
Yana Valasatava Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA
Michael Heuer RISELab, University of California Berkeley, Berkeley, CA, USA
Matt Larson DNASTAR, Madison, WI, USA
Stephen K. Burley Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA Rutgers Cancer Institute of New Jersey, Robert Wood Johnson Medical School, New Brunswick, NJ 08903, USA
Jose M. Duarte Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA

Collapse

Kinjo AR, Bekker GJ, Wako H, Endo S, Tsuchiya Y, Sato H, Nishi H, Kinoshita K, Suzuki H, Kawabata T, Yokochi M, Iwata T, Kobayashi N, Fujiwara T, Kurisu G, Nakamura H. New tools and functions in data-out activities at Protein Data Bank Japan (PDBj). Protein Sci 2017;27:95-102. [PMID: 28815765 PMCID: PMC5734392 DOI: 10.1002/pro.3273] [Citation(s) in RCA: 61] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 08/14/2017] [Indexed: 11/23/2022]

MMTF-An efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLoS Comput Biol 2017;13:e1005575. [PMID: 28574982 PMCID: PMC5473584 DOI: 10.1371/journal.pcbi.1005575] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Revised: 06/16/2017] [Accepted: 05/16/2017] [Indexed: 11/19/2022] Open