1
|
van Kempen M, Kim SS, Tumescheit C, Mirdita M, Lee J, Gilchrist CLM, Söding J, Steinegger M. Fast and accurate protein structure search with Foldseek. Nat Biotechnol 2024; 42:243-246. [PMID: 37156916 PMCID: PMC10869269 DOI: 10.1038/s41587-023-01773-0] [Citation(s) in RCA: 330] [Impact Index Per Article: 330.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Accepted: 03/30/2023] [Indexed: 05/10/2023]
Abstract
As structure prediction methods are generating millions of publicly available protein structures, searching these databases is becoming a bottleneck. Foldseek aligns the structure of a query protein against a database by describing tertiary amino acid interactions within proteins as sequences over a structural alphabet. Foldseek decreases computation times by four to five orders of magnitude with 86%, 88% and 133% of the sensitivities of Dali, TM-align and CE, respectively.
Collapse
Affiliation(s)
- Michel van Kempen
- Quantitative and Computational Biology Group, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
| | - Stephanie S Kim
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | | | - Milot Mirdita
- Quantitative and Computational Biology Group, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | - Jeongjae Lee
- School of Biological Sciences, Seoul National University, Seoul, South Korea
| | | | - Johannes Söding
- Quantitative and Computational Biology Group, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany.
- Campus Institute Data Science (CIDAS), Göttingen, Germany.
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea.
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea.
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea.
| |
Collapse
|
2
|
Draizen EJ, Readey J, Mura C, Bourne PE. Prop3D: A flexible, Python-based platform for machine learning with protein structural properties and biophysical data. BMC Bioinformatics 2024; 25:11. [PMID: 38177985 PMCID: PMC10768222 DOI: 10.1186/s12859-023-05586-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 11/27/2023] [Indexed: 01/06/2024] Open
Abstract
BACKGROUND Machine learning (ML) has a rich history in structural bioinformatics, and modern approaches, such as deep learning, are revolutionizing our knowledge of the subtle relationships between biomolecular sequence, structure, function, dynamics and evolution. As with any advance that rests upon statistical learning approaches, the recent progress in biomolecular sciences is enabled by the availability of vast volumes of sufficiently-variable data. To be useful, such data must be well-structured, machine-readable, intelligible and manipulable. These and related requirements pose challenges that become especially acute at the computational scales typical in ML. Furthermore, in structural bioinformatics such data generally relate to protein three-dimensional (3D) structures, which are inherently more complex than sequence-based data. A significant and recurring challenge concerns the creation of large, high-quality, openly-accessible datasets that can be used for specific training and benchmarking tasks in ML pipelines for predictive modeling projects, along with reproducible splits for training and testing. RESULTS Here, we report 'Prop3D', a platform that allows for the creation, sharing and extensible reuse of libraries of protein domains, featurized with biophysical and evolutionary properties that can range from detailed, atomically-resolved physicochemical quantities (e.g., electrostatics) to coarser, residue-level features (e.g., phylogenetic conservation). As a community resource, we also supply a 'Prop3D-20sf' protein dataset, obtained by applying our approach to CATH . We have developed and deployed the Prop3D framework, both in the cloud and on local HPC resources, to systematically and reproducibly create comprehensive datasets via the Highly Scalable Data Service ( HSDS ). Our datasets are freely accessible via a public HSDS instance, or they can be used with accompanying Python wrappers for popular ML frameworks. CONCLUSION Prop3D and its associated Prop3D-20sf dataset can be of broad utility in at least three ways. Firstly, the Prop3D workflow code can be customized and deployed on various cloud-based compute platforms, with scalability achieved largely by saving the results to distributed HDF5 files via HSDS . Secondly, the linked Prop3D-20sf dataset provides a hand-crafted, already-featurized dataset of protein domains for 20 highly-populated CATH families; importantly, provision of this pre-computed resource can aid the more efficient development (and reproducible deployment) of ML pipelines. Thirdly, Prop3D-20sf's construction explicitly takes into account (in creating datasets and data-splits) the enigma of 'data leakage', stemming from the evolutionary relationships between proteins.
Collapse
Affiliation(s)
- Eli J Draizen
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
| | | | - Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
| | - Philip E Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA
- School of Data Science, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
3
|
Staniscia L, Yu YW. Image-centric compression of protein structures improves space savings. BMC Bioinformatics 2023; 24:437. [PMID: 37990290 PMCID: PMC10664254 DOI: 10.1186/s12859-023-05570-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2023] [Accepted: 11/15/2023] [Indexed: 11/23/2023] Open
Abstract
BACKGROUND Because of the rapid generation of data, the study of compression algorithms to reduce storage and transmission costs is important to bioinformaticians. Much of the focus has been on sequence data, including both genomes and protein amino acid sequences stored in FASTA files. Current standard practice is to use an ordinary lossless compressor such as gzip on a sequential list of atomic coordinates, but this approach expends bits on saving an arbitrary ordering of atoms, and it also prevents reordering the atoms for compressibility. The standard MMTF and BCIF file formats extend this approach with custom encoding of the coordinates. However, the brand new Foldcomp tool introduces a new paradigm of compressing local angles, to great effect. In this article, we explore a different paradigm, showing for the first time that image-based compression using global angles can also significantly improve compression ratios. To this end, we implement a prototype compressor 'PIC', specialized for point clouds of atom coordinates contained in PDB and mmCIF files. PIC maps the 3D data to a 2D 8-bit greyscale image and leverages the well developed PNG image compressor to minimize the size of the resulting image, forming the compressed file. RESULTS PIC outperforms gzip in terms of compression ratio on proteins over 20,000 atoms in size, with a savings over gzip of up to 37.4% on the proteins compressed. In addition, PIC's compression ratio increases with protein size. CONCLUSION Image-centric compression as demonstrated by our prototype PIC provides a potential means of constructing 3D structure-aware protein compression software, though future work would be necessary to make this practical.
Collapse
Affiliation(s)
- Luke Staniscia
- Department of Mathematics, University of Toronto, Toronto, ON, Canada
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, ON, Canada.
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
4
|
Zhang C, Pyle AM. PDC: a highly compact file format to store protein 3D coordinates. Database (Oxford) 2023; 2023:baad018. [PMID: 37010520 PMCID: PMC10069377 DOI: 10.1093/database/baad018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 03/01/2023] [Accepted: 03/07/2023] [Indexed: 04/04/2023]
Abstract
Recent improvements in computational and experimental techniques for obtaining protein structures have resulted in an explosion of 3D coordinate data. To cope with the ever-increasing sizes of structure databases, this work proposes the Protein Data Compression (PDC) format, which compresses coordinates and temperature factors of full-atomic and Cα-only protein structures. Without loss of precision, PDC results in 69% to 78% smaller file sizes than Protein Data Bank (PDB) and macromolecular Crystallographic Information File (mmCIF) files with standard GZIP compression. It uses ∼60% less space than existing compression algorithms specific to macromolecular structures. PDC optionally performs lossy compression with minimal sacrifice of precision, which allows reduction of file sizes by another 79%. Conversion between PDC, mmCIF and PDB formats is typically achieved within 0.02 s. The compactness and fast reading/writing speed of PDC make it valuable for storage and analysis of large quantity of tertiary structural data. Database URL https://github.com/kad-ecoli/pdc.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Av, Ann Arbor, MI 48109, USA
- Howard Hughes Medical Institute, 4000 Jones Bridge Rd, Chevy Chase, MD 20815, USA
- Department of Molecular, Cellular, and Developmental Biology, Yale University, 266 Whitney Av, New Haven, CT 06511, USA
| | - Anna Marie Pyle
- Howard Hughes Medical Institute, 4000 Jones Bridge Rd, Chevy Chase, MD 20815, USA
- Department of Molecular, Cellular, and Developmental Biology, Yale University, 266 Whitney Av, New Haven, CT 06511, USA
- Department of Chemistry, Yale University, 225 Prospect St, New Haven, CT 06511, USA
| |
Collapse
|
5
|
Kim H, Mirdita M, Steinegger M. Foldcomp: a library and format for compressing and indexing large protein structure sets. Bioinformatics 2023; 39:btad153. [PMID: 36961332 PMCID: PMC10085514 DOI: 10.1093/bioinformatics/btad153] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/17/2023] [Accepted: 03/19/2023] [Indexed: 03/25/2023] Open
Abstract
SUMMARY Highly accurate protein structure predictors have generated hundreds of millions of protein structures; these pose a challenge in terms of storage and processing. Here, we present Foldcomp, a novel lossy structure compression algorithm, and indexing system to address this challenge. By using a combination of internal and Cartesian coordinates and a bi-directional NeRF-based strategy, Foldcomp improves the compression ratio by a factor of three compared to the next best method. Its reconstruction error of 0.08 Å is comparable to the best lossy compressor. It is five times faster than the next fastest compressor and competes with the fastest decompressors. With its multi-threading implementation and a Python interface that allows for easy database downloads and efficient querying of protein structures by accession, Foldcomp is a powerful tool for managing and analysing large collections of protein structures. AVAILABILITY AND IMPLEMENTATION Foldcomp is a free open-source software (GPLv3) and available for Linux, macOS, and Windows at https://foldcomp.foldseek.com. Foldcomp provides the AlphaFold Swiss-Prot (2.9GB), TrEMBL (1.1TB), and ESMatlas HQ (114GB) database ready-for-download.
Collapse
Affiliation(s)
- Hyunbin Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
| | - Milot Mirdita
- School of Biological Sciences, Seoul National University, Seoul 08826, South Korea
| | - Martin Steinegger
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, South Korea
- School of Biological Sciences, Seoul National University, Seoul 08826, South Korea
- Institute of Molecular Biology and Genetics, Seoul National University, Seoul 08826, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul 08826, South Korea
| |
Collapse
|
6
|
Sehnal D, Bittrich S, Deshpande M, Svobodová R, Berka K, Bazgier V, Velankar S, Burley SK, Koča J, Rose AS. Mol* Viewer: modern web app for 3D visualization and analysis of large biomolecular structures. Nucleic Acids Res 2021; 49:W431-W437. [PMID: 33956157 PMCID: PMC8262734 DOI: 10.1093/nar/gkab314] [Citation(s) in RCA: 501] [Impact Index Per Article: 167.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Revised: 04/12/2021] [Accepted: 04/26/2021] [Indexed: 12/31/2022] Open
Abstract
Large biomolecular structures are being determined experimentally on a daily basis using established techniques such as crystallography and electron microscopy. In addition, emerging integrative or hybrid methods (I/HM) are producing structural models of huge macromolecular machines and assemblies, sometimes containing 100s of millions of non-hydrogen atoms. The performance requirements for visualization and analysis tools delivering these data are increasing rapidly. Significant progress in developing online, web-native three-dimensional (3D) visualization tools was previously accomplished with the introduction of the LiteMol suite and NGL Viewers. Thereafter, Mol* development was jointly initiated by PDBe and RCSB PDB to combine and build on the strengths of LiteMol (developed by PDBe) and NGL (developed by RCSB PDB). The web-native Mol* Viewer enables 3D visualization and streaming of macromolecular coordinate and experimental data, together with capabilities for displaying structure quality, functional, or biological context annotations. High-performance graphics and data management allows users to simultaneously visualise up to hundreds of (superimposed) protein structures, stream molecular dynamics simulation trajectories, render cell-level models, or display huge I/HM structures. It is the primary 3D structure viewer used by PDBe and RCSB PDB. It can be easily integrated into third-party services. Mol* Viewer is open source and freely available at https://molstar.org/.
Collapse
Affiliation(s)
- David Sehnal
- CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic.,Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sebastian Bittrich
- Research Collaboratory for Structural Bioinformatics (RCSB), San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093-0743, USA
| | - Mandar Deshpande
- Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Radka Svobodová
- CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
| | - Karel Berka
- Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic
| | - Václav Bazgier
- Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic
| | - Sameer Velankar
- Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Stephen K Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), Institute for Quantitative Biomedicine and Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, NJ 08854-8076, USA.,Rutgers Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, NJ 08903-2681, USA.,Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB PDB), San Diego Supercomputer Center and Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, San Diego, CA 92093-0654, USA
| | - Jaroslav Koča
- CEITEC - Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
| | - Alexander S Rose
- Research Collaboratory for Structural Bioinformatics (RCSB), San Diego Supercomputer Center, University of California San Diego, San Diego, CA 92093-0743, USA
| |
Collapse
|
7
|
Bittrich S, Burley SK, Rose AS. Real-time structural motif searching in proteins using an inverted index strategy. PLoS Comput Biol 2020; 16:e1008502. [PMID: 33284792 PMCID: PMC7746303 DOI: 10.1371/journal.pcbi.1008502] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 12/17/2020] [Accepted: 11/09/2020] [Indexed: 12/30/2022] Open
Abstract
Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins. Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. Existing methods depend on a clustering by sequence similarity and can be computationally slow. We have developed a new approach that uses an inverted index strategy capable of analyzing >170,000 PDB structures with unmatched speed. The efficiency of the inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach (implemented at motif.rcsb.org) enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user. Herein, we describe the method and present five case studies that exemplify its efficacy and speed for analyzing 3D structures of both proteins and nucleic acids. The Protein Data Bank (PDB) provides open access to more than 170,000 three-dimensional structures of proteins, nucleic acids, and biological complexes. Similarities between PDB structures give valuable functional and evolutionary insights but such resemblance may not be evident at sequence or global structure level. Throughout the database, there are recurring structural motifs—groups of modest numbers of residues in proximity that, for example, support catalytic activity. Identification of common structural motifs can reveal similarities between proteins and serve as fingerprints for spatial configurations of amino acids, such as the His-Asp-Ser catalytic triad found in serine proteases or the zinc coordination site found in Zinc Finger DNA-binding domains. We present a highly efficient yet flexible strategy that allows users for the first time to search for arbitrary structural motifs across the entire PDB archive in real-time. Our approach scales favorably with the increasing number and complexity of deposited structures, and, also, has the potential to be adapted for other applications in a macromolecular context.
Collapse
Affiliation(s)
- Sebastian Bittrich
- RCSB Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, USA
- * E-mail:
| | - Stephen K. Burley
- RCSB Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, USA
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, Piscataway, New Jersey, USA
- Cancer Institute of New Jersey, Rutgers, The State University of New Jersey, New Brunswick, New Jersey, USA
- Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, California, USA
| | - Alexander S. Rose
- RCSB Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, California, USA
| |
Collapse
|
8
|
BinaryCIF and CIFTools-Lightweight, efficient and extensible macromolecular data management. PLoS Comput Biol 2020; 16:e1008247. [PMID: 33075050 PMCID: PMC7595629 DOI: 10.1371/journal.pcbi.1008247] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 10/29/2020] [Accepted: 08/14/2020] [Indexed: 02/07/2023] Open
Abstract
3D macromolecular structural data is growing ever more complex and plentiful in the wake of substantive advances in experimental and computational structure determination methods including macromolecular crystallography, cryo-electron microscopy, and integrative methods. Efficient means of working with 3D macromolecular structural data for archiving, analyses, and visualization are central to facilitating interoperability and reusability in compliance with the FAIR Principles. We address two challenges posed by growth in data size and complexity. First, data size is reduced by bespoke compression techniques. Second, complexity is managed through improved software tooling and fully leveraging available data dictionary schemas. To this end, we introduce BinaryCIF, a serialization of Crystallographic Information File (CIF) format files that maintains full compatibility to related data schemas, such as PDBx/mmCIF, while reducing file sizes by more than a factor of two versus gzip compressed CIF files. Moreover, for the largest structures, BinaryCIF provides even better compression—factor ten and four versus CIF files and gzipped CIF files, respectively. Herein, we describe CIFTools, a set of libraries in Java and TypeScript for generic and typed handling of CIF and BinaryCIF files. Together, BinaryCIF and CIFTools enable lightweight, efficient, and extensible handling of 3D macromolecular structural data.
Collapse
|
9
|
Rose AS, Bradley AR, Valasatava Y, Duarte JM, Prlic A, Rose PW. NGL viewer: web-based molecular graphics for large complexes. Bioinformatics 2019; 34:3755-3758. [PMID: 29850778 DOI: 10.1093/bioinformatics/bty419] [Citation(s) in RCA: 323] [Impact Index Per Article: 64.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Accepted: 05/22/2018] [Indexed: 12/25/2022] Open
Abstract
Motivation The interactive visualization of very large macromolecular complexes on the web is becoming a challenging problem as experimental techniques advance at an unprecedented rate and deliver structures of increasing size. Results We have tackled this problem by developing highly memory-efficient and scalable extensions for the NGL WebGL-based molecular viewer and by using Macromolecular Transmission Format (MMTF), a binary and compressed MMTF. These enable NGL to download and render molecular complexes with millions of atoms interactively on desktop computers and smartphones alike, making it a tool of choice for web-based molecular visualization in research and education. Availability and implementation The source code is freely available under the MIT license at github.com/arose/ngl and distributed on NPM (npmjs.com/package/ngl). MMTF-JavaScript encoders and decoders are available at github.com/rcsb/mmtf-javascript.
Collapse
Affiliation(s)
- Alexander S Rose
- RCSB Protein Data Bank.,San Diego Supercomputer Center, UC San Diego, CA, USA
| | - Anthony R Bradley
- RCSB Protein Data Bank.,San Diego Supercomputer Center, UC San Diego, CA, USA
| | | | - Jose M Duarte
- RCSB Protein Data Bank.,San Diego Supercomputer Center, UC San Diego, CA, USA
| | - Andreas Prlic
- RCSB Protein Data Bank.,San Diego Supercomputer Center, UC San Diego, CA, USA
| | - Peter W Rose
- RCSB Protein Data Bank.,San Diego Supercomputer Center, UC San Diego, CA, USA
| |
Collapse
|
10
|
Sehnal D, Deshpande M, Vařeková RS, Mir S, Berka K, Midlik A, Pravda L, Velankar S, Koča J. LiteMol suite: interactive web-based visualization of large-scale macromolecular structure data. Nat Methods 2019; 14:1121-1122. [PMID: 29190272 DOI: 10.1038/nmeth.4499] [Citation(s) in RCA: 96] [Impact Index Per Article: 19.2] [Reference Citation Analysis] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- David Sehnal
- CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic.,Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Mandar Deshpande
- Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Radka Svobodová Vařeková
- CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Saqib Mir
- Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Karel Berka
- Regional Centre of Advanced Technologies and Materials, Department of Physical Chemistry, Faculty of Science, Palacký University, Olomouc, Czech Republic
| | - Adam Midlik
- CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Lukáš Pravda
- CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
| | - Sameer Velankar
- Protein Data Bank in Europe (PDBe), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Jaroslav Koča
- CEITEC, Central European Institute of Technology, Masaryk University, Brno, Czech Republic.,National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, Czech Republic
| |
Collapse
|
11
|
Lafita A, Bliven S, Prlić A, Guzenko D, Rose PW, Bradley A, Pavan P, Myers-Turnbull D, Valasatava Y, Heuer M, Larson M, Burley SK, Duarte JM. BioJava 5: A community driven open-source bioinformatics library. PLoS Comput Biol 2019; 15:e1006791. [PMID: 30735498 PMCID: PMC6383946 DOI: 10.1371/journal.pcbi.1006791] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Revised: 02/21/2019] [Accepted: 01/13/2019] [Indexed: 11/19/2022] Open
Abstract
BioJava is an open-source project that provides a Java library for processing biological data. The project aims to simplify bioinformatic analyses by implementing parsers, data structures, and algorithms for common tasks in genomics, structural biology, ontologies, phylogenetics, and more. Since 2012, we have released two major versions of the library (4 and 5) that include many new features to tackle challenges with increasingly complex macromolecular structure data. BioJava requires Java 8 or higher and is freely available under the LGPL 2.1 license. The project is hosted on GitHub at https://github.com/biojava/biojava. More information and documentation can be found online on the BioJava website (http://www.biojava.org) and tutorial (https://github.com/biojava/biojava-tutorial). All inquiries should be directed to the GitHub page or the BioJava mailing list (http://lists.open-bio.org/mailman/listinfo/biojava-l).
Collapse
Affiliation(s)
- Aleix Lafita
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Spencer Bliven
- Zurich University of Applied Sciences (ZHAW), Zurich CH-8021, Switzerland
| | - Andreas Prlić
- San Diego Supercomputer Center, UCSD, San Diego, CA 92093, USA
| | - Dmytro Guzenko
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA
| | - Peter W. Rose
- Structural Bioinformatics Laboratory, San Diego Supercomputer Center, UCSD, San Diego, CA 92093, USA
| | - Anthony Bradley
- Structural Genomics Consortium, University of Oxford, Oxford OX3 7DQ, UK
- Department of Chemistry, University of Oxford, Oxford OX1 3TA, UK
- Diamond Light Source Ltd., Didcot OX11 0DE, UK
| | | | - Douglas Myers-Turnbull
- Institute for Neurodegenerative Diseases, University of California, San Francisco, CA 94143, USA
| | - Yana Valasatava
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA
| | - Michael Heuer
- RISELab, University of California Berkeley, Berkeley, CA, USA
| | | | - Stephen K. Burley
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
- Rutgers Cancer Institute of New Jersey, Robert Wood Johnson Medical School, New Brunswick, NJ 08903, USA
| | - Jose M. Duarte
- Research Collaboratory for Structural Bioinformatics Protein Data Bank, San Diego Supercomputer Center, University of California, San Diego, La Jolla, CA 92093, USA
| |
Collapse
|
12
|
Kinjo AR, Bekker GJ, Wako H, Endo S, Tsuchiya Y, Sato H, Nishi H, Kinoshita K, Suzuki H, Kawabata T, Yokochi M, Iwata T, Kobayashi N, Fujiwara T, Kurisu G, Nakamura H. New tools and functions in data-out activities at Protein Data Bank Japan (PDBj). Protein Sci 2017; 27:95-102. [PMID: 28815765 PMCID: PMC5734392 DOI: 10.1002/pro.3273] [Citation(s) in RCA: 61] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 08/14/2017] [Indexed: 11/23/2022]
Abstract
The Protein Data Bank Japan (PDBj), a member of the worldwide Protein Data Bank (wwPDB), accepts and processes the deposited data of experimentally determined biological macromolecular structures. In addition to archiving the PDB data in collaboration with the other wwPDB partners, PDBj also provides a wide range of original and unique services and tools, which are continuously improved and updated. Here, we report the new RDB PDBj Mine 2, the WebGL molecular viewer Molmil, the ProMode‐Elastic server for normal mode analysis, a virtual reality system for the eF‐site protein electrostatic molecular surfaces, the extensions of the Omokage search for molecular shape similarity, and the integration of PDBj and BMRB searches.
Collapse
Affiliation(s)
- Akira R Kinjo
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Gert-Jan Bekker
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Hiroshi Wako
- School of Social Sciences, Waseda University, 1-6-1 Nishi-Waseda, Shinjuku-ku, Tokyo, 169-8050, Japan
| | - Shigeru Endo
- School of Science, Kitasato University, 1-15-1, Kitasato, Minami-ku, Sagamihara, Kanagawa, 252-0373, Japan
| | - Yuko Tsuchiya
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Hiromu Sato
- Graduate School of Information Sciences, Tohoku University, 6-3-09 Aoba, Aramaki-aza Aoba-ku, Sendai, 980-8579, Japan
| | - Hafumi Nishi
- Graduate School of Information Sciences, Tohoku University, 6-3-09 Aoba, Aramaki-aza Aoba-ku, Sendai, 980-8579, Japan
| | - Kengo Kinoshita
- Graduate School of Information Sciences, Tohoku University, 6-3-09 Aoba, Aramaki-aza Aoba-ku, Sendai, 980-8579, Japan
| | - Hirofumi Suzuki
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Takeshi Kawabata
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Masashi Yokochi
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Takeshi Iwata
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Naohiro Kobayashi
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Toshimichi Fujiwara
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Genji Kurisu
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| | - Haruki Nakamura
- Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka, 565-0871, Japan
| |
Collapse
|
13
|
MMTF-An efficient file format for the transmission, visualization, and analysis of macromolecular structures. PLoS Comput Biol 2017; 13:e1005575. [PMID: 28574982 PMCID: PMC5473584 DOI: 10.1371/journal.pcbi.1005575] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Revised: 06/16/2017] [Accepted: 05/16/2017] [Indexed: 11/19/2022] Open
Abstract
Recent advances in experimental techniques have led to a rapid growth in complexity, size, and number of macromolecular structures that are made available through the Protein Data Bank. This creates a challenge for macromolecular visualization and analysis. Macromolecular structure files, such as PDB or PDBx/mmCIF files can be slow to transfer, parse, and hard to incorporate into third-party software tools. Here, we present a new binary and compressed data representation, the MacroMolecular Transmission Format, MMTF, as well as software implementations in several languages that have been developed around it, which address these issues. We describe the new format and its APIs and demonstrate that it is several times faster to parse, and about a quarter of the file size of the current standard format, PDBx/mmCIF. As a consequence of the new data representation, it is now possible to visualize structures with millions of atoms in a web browser, keep the whole PDB archive in memory or parse it within few minutes on average computers, which opens up a new way of thinking how to design and implement efficient algorithms in structural bioinformatics. The PDB archive is available in MMTF file format through web services and data that are updated on a weekly basis.
Collapse
|