1
|
Klukowski P, Damberger FF, Allain FHT, Iwai H, Kadavath H, Ramelot TA, Montelione GT, Riek R, Güntert P. The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis. Sci Data 2024; 11:30. [PMID: 38177162 PMCID: PMC10767026 DOI: 10.1038/s41597-023-02879-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/22/2023] [Indexed: 01/06/2024] Open
Abstract
Multidimensional NMR spectra are the basis for studying proteins by NMR spectroscopy and crucial for the development and evaluation of methods for biomolecular NMR data analysis. Nevertheless, in contrast to derived data such as chemical shift assignments in the BMRB and protein structures in the PDB databases, this primary data is in general not publicly archived. To change this unsatisfactory situation, we present a standardized set of solution NMR data comprising 1329 2-4-dimensional NMR spectra and associated reference (chemical shift assignments, structures) and derived (peak lists, restraints for structure calculation, etc.) annotations. With the 100-protein NMR spectra dataset that was originally compiled for the development of the ARTINA deep learning-based spectra analysis method, 100 protein structures can be reproduced from their original experimental data. The 100-protein NMR spectra dataset is expected to help the development of computational methods for NMR spectroscopy, in particular machine learning approaches, and enable consistent and objective comparisons of these methods.
Collapse
Affiliation(s)
- Piotr Klukowski
- Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland.
| | - Fred F Damberger
- Institute of Biochemistry, ETH Zurich, 8093, Zurich, Switzerland
| | | | - Hideo Iwai
- Institute of Biotechnology, University of Helsinki, 00100, Helsinki, Finland
| | | | - Theresa A Ramelot
- Department of Chemistry and Chemical Biology, and Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
| | - Gaetano T Montelione
- Department of Chemistry and Chemical Biology, and Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
| | - Roland Riek
- Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland.
| | - Peter Güntert
- Institute of Molecular Physical Science, ETH Zurich, 8093, Zurich, Switzerland.
- Institute of Biophysical Chemistry, Goethe University, 60438, Frankfurt am Main, Germany.
- Department of Chemistry, Tokyo Metropolitan University, Hachioji, 192-0397, Tokyo, Japan.
| |
Collapse
|
2
|
Fraga KJ, Huang YJ, Ramelot TA, Swapna GVT, Lashawn Anak Kendary A, Li E, Korf I, Montelione GT. SpecDB: A relational database for archiving biomolecular NMR spectral data. JOURNAL OF MAGNETIC RESONANCE (SAN DIEGO, CALIF. : 1997) 2022; 342:107268. [PMID: 35930941 PMCID: PMC9922030 DOI: 10.1016/j.jmr.2022.107268] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 06/16/2022] [Accepted: 07/06/2022] [Indexed: 05/11/2023]
Abstract
NMR is a valuable experimental tool in the structural biologist's toolkit to elucidate the structures, functions, and motions of biomolecules. The progress of machine learning, particularly in structural biology, reveals the critical importance of large, diverse, and reliable datasets in developing new methods and understanding in structural biology and science more broadly. Biomolecular NMR research groups produce large amounts of data, and there is renewed interest in organizing these data to train new, sophisticated machine learning architectures and to improve biomolecular NMR analysis pipelines. The foundational data type in NMR is the free-induction decay (FID). There are opportunities to build sophisticated machine learning methods to tackle long-standing problems in NMR data processing, resonance assignment, dynamics analysis, and structure determination using NMR FIDs. Our goal in this study is to provide a lightweight, broadly available tool for archiving FID data as it is generated at the spectrometer, and grow a new resource of FID data and associated metadata. This study presents a relational schema for storing and organizing the metadata items that describe an NMR sample and FID data, which we call Spectral Database (SpecDB). SpecDB is implemented in SQLite and includes a Python software library providing a command-line application to create, organize, query, backup, share, and maintain the database. This set of software tools and database schema allow users to store, organize, share, and learn from NMR time domain data. SpecDB is freely available under an open source license at https://github.rpi.edu/RPIBioinformatics/SpecDB.
Collapse
Affiliation(s)
- Keith J Fraga
- Department of Molecular and Cellular Biology, University of California, Davis, CA 95616, USA.
| | - Yuanpeng J Huang
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| | - Theresa A Ramelot
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| | - G V T Swapna
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA; Department of Pharmacology, Robert Wood Johnson Medical School, Rutgers The State University of New Jersey, Piscataway, NJ 08854, USA.
| | | | - Ethan Li
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| | - Ian Korf
- Department of Molecular and Cellular Biology, University of California, Davis, CA 95616, USA.
| | - Gaetano T Montelione
- Department of Chemistry and Chemical Biology, Center for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180 USA.
| |
Collapse
|
3
|
Gryk MR, Ludäscher B. Workflows and Provenance: Toward Information Science Solutions for the Natural Sciences. LIBRARY TRENDS 2018; 65:555-562. [PMID: 29375158 DOI: 10.1353/lib.2017.0018] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
The era of big data and ubiquitous computation has brought with it concerns about ensuring reproducibility in this new research environment. It is easy to assume computational methods self-document by their very nature of being exact, deterministic processes. However, similar to laboratory experiments, ensuring reproducibility in the computational realm requires the documentation of both the protocols used (workflows) as well as a detailed description of the computational environment: algorithms, implementations, software environments as well as the data ingested and execution logs of the computation. These two aspects of computational reproducibility (workflows and execution details) are discussed in the context of biomolecular Nuclear Magnetic Resonance spectroscopy (bioNMR) as well as the PRIMAD model for computational reproducibility.
Collapse
Affiliation(s)
- Michael R Gryk
- Department of Molecular Biology and Biophysics, UCONN Health, 263 Farmington Avenue, Farmington, CT 06030-3305 USA.,School of Information Sciences, University of Illinois, Urbana-Champaign, 501 East Daniel Street, Champaign, IL 61820-6211 USA
| | - Bertram Ludäscher
- School of Information Sciences, University of Illinois, Urbana-Champaign, 501 East Daniel Street, Champaign, IL 61820-6211 USA
| |
Collapse
|
4
|
Heintz D, Gryk MR. Curating Scientific Workflows for Biomolecular Nuclear Magnetic Resonance Spectroscopy. INTERNATIONAL JOURNAL OF DIGITAL CURATION 2018; 13:286-293. [PMID: 31061674 PMCID: PMC6499392 DOI: 10.2218/ijdc.v13i1.657] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
This paper describes our recent and ongoing efforts for enhancing the curation of scientific workflows to improve reproducibility and reusability of biomolecular nuclear magnetic resonance (bioNMR) data. Our efforts have focused on both developing a workflow management system, called CONNJUR Workflow Builder (CWB), as well as refactoring our workflow data model to make use of the PREMIS model for digital preservation. This revised workflow management system will be available through the NMRbox cloud-computing platform for bioNMR. In addition, we are implementing a new file structure which bundles the original binary data files along with PREMIS XML records describing the provenance of the data. These are packaged together using a standardized file archive utility. In this manner, the provenance and data curation information is maintained together along with the scientific data. The benefits and limitations of these approaches as well as future directions are discussed.
Collapse
|
5
|
Maciejewski MW, Schuyler AD, Gryk MR, Moraru II, Romero PR, Ulrich EL, Eghbalnia HR, Livny M, Delaglio F, Hoch JC. NMRbox: A Resource for Biomolecular NMR Computation. Biophys J 2017; 112:1529-1534. [PMID: 28445744 DOI: 10.1016/j.bpj.2017.03.011] [Citation(s) in RCA: 274] [Impact Index Per Article: 39.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2017] [Revised: 03/06/2017] [Accepted: 03/13/2017] [Indexed: 10/19/2022] Open
Abstract
Advances in computation have been enabling many recent advances in biomolecular applications of NMR. Due to the wide diversity of applications of NMR, the number and variety of software packages for processing and analyzing NMR data is quite large, with labs relying on dozens, if not hundreds of software packages. Discovery, acquisition, installation, and maintenance of all these packages is a burdensome task. Because the majority of software packages originate in academic labs, persistence of the software is compromised when developers graduate, funding ceases, or investigators turn to other projects. To simplify access to and use of biomolecular NMR software, foster persistence, and enhance reproducibility of computational workflows, we have developed NMRbox, a shared resource for NMR software and computation. NMRbox employs virtualization to provide a comprehensive software environment preconfigured with hundreds of software packages, available as a downloadable virtual machine or as a Platform-as-a-Service supported by a dedicated compute cloud. Ongoing development includes a metadata harvester to regularize, annotate, and preserve workflows and facilitate and enhance data depositions to BioMagResBank, and tools for Bayesian inference to enhance the robustness and extensibility of computational analyses. In addition to facilitating use and preservation of the rich and dynamic software environment for biomolecular NMR, NMRbox fosters the development and deployment of a new class of metasoftware packages. NMRbox is freely available to not-for-profit users.
Collapse
Affiliation(s)
- Mark W Maciejewski
- Department of Molecular Biology and Biophysics, UConn Health, Farmington, Connecticut
| | - Adam D Schuyler
- Department of Molecular Biology and Biophysics, UConn Health, Farmington, Connecticut
| | - Michael R Gryk
- Department of Molecular Biology and Biophysics, UConn Health, Farmington, Connecticut
| | - Ion I Moraru
- Department of Cell Biology, UConn Health, Farmington, Connecticut
| | - Pedro R Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin
| | - Eldon L Ulrich
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin
| | - Hamid R Eghbalnia
- Department of Biochemistry, University of Wisconsin-Madison, Madison, Wisconsin
| | - Miron Livny
- Computer Sciences Department, University of Wisconsin-Madison, Madison, Wisconsin
| | - Frank Delaglio
- Institute for Bioscience and Biotechnology Research, National Institute of Standards and Technology and the University of Maryland, Rockville, Maryland
| | - Jeffrey C Hoch
- Department of Molecular Biology and Biophysics, UConn Health, Farmington, Connecticut.
| |
Collapse
|
6
|
Lee W, Cornilescu G, Dashti H, Eghbalnia HR, Tonelli M, Westler WM, Butcher SE, Henzler-Wildman KA, Markley JL. Integrative NMR for biomolecular research. JOURNAL OF BIOMOLECULAR NMR 2016; 64:307-32. [PMID: 27023095 PMCID: PMC4861749 DOI: 10.1007/s10858-016-0029-x] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Accepted: 03/21/2016] [Indexed: 05/05/2023]
Abstract
NMR spectroscopy is a powerful technique for determining structural and functional features of biomolecules in physiological solution as well as for observing their intermolecular interactions in real-time. However, complex steps associated with its practice have made the approach daunting for non-specialists. We introduce an NMR platform that makes biomolecular NMR spectroscopy much more accessible by integrating tools, databases, web services, and video tutorials that can be launched by simple installation of NMRFAM software packages or using a cross-platform virtual machine that can be run on any standard laptop or desktop computer. The software package can be downloaded freely from the NMRFAM software download page ( http://pine.nmrfam.wisc.edu/download_packages.html ), and detailed instructions are available from the Integrative NMR Video Tutorial page ( http://pine.nmrfam.wisc.edu/integrative.html ).
Collapse
Affiliation(s)
- Woonghee Lee
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA.
| | - Gabriel Cornilescu
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Hesam Dashti
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Hamid R Eghbalnia
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Marco Tonelli
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - William M Westler
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Samuel E Butcher
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - Katherine A Henzler-Wildman
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA
| | - John L Markley
- National Magnetic Resonance Facility at Madison and Biochemistry Department, University of Wisconsin-Madison, Madison, WI, 53706, USA.
| |
Collapse
|