1
|
Abstract
Background Single-cell RNA sequencing is becoming a powerful tool to identify cell states, reconstruct developmental trajectories, and deconvolute spatial expression. The rapid development of computational methods promotes the insight of heterogeneous single-cell data. An increasing number of tools have been provided for biological analysts, of which two programming languages- R and Python are widely used among researchers. R and Python are complementary, as many methods are implemented specifically in R or Python. However, the different platforms immediately caused the data sharing and transformation problem, especially for Scanpy, Seurat, and SingleCellExperiemnt. Currently, there is no efficient and user-friendly software to perform data transformation of single-cell omics between platforms, which makes users spend unbearable time on data Input and Output (IO), significantly reducing the efficiency of data analysis. Results We developed scDIOR for single-cell data transformation between platforms of R and Python based on Hierarchical Data Format Version 5 (HDF5). We have created a data IO ecosystem between three R packages (Seurat, SingleCellExperiment, Monocle) and a Python package (Scanpy). Importantly, scDIOR accommodates a variety of data types across programming languages and platforms in an ultrafast way, including single-cell RNA-seq and spatial resolved transcriptomics data, using only a few codes in IDE or command line interface. For large scale datasets, users can partially load the needed information, e.g., cell annotation without the gene expression matrices. scDIOR connects the analytical tasks of different platforms, which makes it easy to compare the performance of algorithms between them. Conclusions scDIOR contains two modules, dior in R and diopy in Python. scDIOR is a versatile and user-friendly tool that implements single-cell data transformation between R and Python rapidly and stably. The software is freely accessible at https://github.com/JiekaiLab/scDIOR.
Collapse
Affiliation(s)
- Huijian Feng
- Center for Cell Lineage and Atlas, Bioland Laboratory (Guangzhou Regenerative Medicine and Health Guangdong Laboratory), Guangzhou, 510005, People's Republic of China.,CAS Key Laboratory of Regenerative Biology, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, People's Republic of China
| | - Lihui Lin
- CAS Key Laboratory of Regenerative Biology, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, People's Republic of China.
| | - Jiekai Chen
- Center for Cell Lineage and Atlas, Bioland Laboratory (Guangzhou Regenerative Medicine and Health Guangdong Laboratory), Guangzhou, 510005, People's Republic of China. .,CAS Key Laboratory of Regenerative Biology, Guangdong Provincial Key Laboratory of Stem Cell and Regenerative Medicine, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, 510530, People's Republic of China. .,Joint School of Life Sciences, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou Medical University, Guangzhou, 511436, People's Republic of China. .,Centre for Regenerative Medicine and Health, Hong Kong Institute of Science and Innovation, Chinese Academy of Sciences, Hong Kong SAR, People's Republic of China.
| |
Collapse
|
2
|
Bhamber RS, Jankevics A, Deutsch EW, Jones AR, Dowsey AW. mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements. J Proteome Res 2021; 20:172-183. [PMID: 32864978 PMCID: PMC7871438 DOI: 10.1021/acs.jproteome.0c00192] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2020] [Indexed: 12/24/2022]
Abstract
With ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise extensible markup language (XML) representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format "mzMLb" that is optimized for both read/write speed and storage of the raw mass spectrometry data. We provide an extensive validation of the write speed, random read speed, and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb's design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilized by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.
Collapse
Affiliation(s)
- Ranjeet S. Bhamber
- Department of Population Health Sciences and Bristol
Veterinary School, University of Bristol, Bristol BS8 2BN,
United Kingdom
| | - Andris Jankevics
- School of Biosciences and Phenome Centre Birmingham,
University of Birmingham, Birmingham B15 2TT, United
Kingdom
| | - Eric W. Deutsch
- Institute for Systems
Biology, Seattle, Washington 98109, United States
| | - Andrew R. Jones
- Institute of Integrative Biology,
University of Liverpool, Liverpool L69 7ZB, United
Kingdom
| | - Andrew W. Dowsey
- Department of Population Health Sciences and Bristol
Veterinary School, University of Bristol, Bristol BS8 2BN,
United Kingdom
| |
Collapse
|
3
|
Ježek P, Teeters JL, Sommer FT. NWB Query Engines: Tools to Search Data Stored in Neurodata Without Borders Format. Front Neuroinform 2020; 14:27. [PMID: 33041776 PMCID: PMC7526650 DOI: 10.3389/fninf.2020.00027] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2020] [Accepted: 05/20/2020] [Indexed: 11/19/2022] Open
Abstract
The Neurodata Without Borders (abbreviation NWB) format is a current technology for storing neurophysiology data along with the associated metadata. Data stored in the format is organized into separate HDF5 files, each file usually storing the data associated with a single recording session. While the NWB format provides a structured method for storing data, so far there have not been tools which enable searching a collection of NWB files in order to find data of interest for a particular purpose. We describe here three tools to enable searching NWB files. The tools have different features making each of them most useful for a particular task. The first tool, called the NWB Query Engine, is written in Java. It allows searching the complete content of NWB files. It was designed for the first version of NWB (NWB 1) and supports most (but not all) features of the most recent version (NWB 2). For some searches, it is the fastest tool. The second tool, called “search_nwb” is written in Python and also allow searching the complete contents of NWB files. It works with both NWB 1 and NWB 2, as does the third tool. The third tool, called “nwbindexer” enables searching a collection of NWB files using a two-step process. In the first step, a utility is run which creates an SQLite database containing the metadata in a collection of NWB files. This database is then searched in the second step, using another utility. Once the index is built, this two-step processes allows faster searches than are done by the other tools, but does not enable as complete of searches. All three tools use a simple query language which was developed for this project. Software integrating the three tools into a web-interface is provided which enables searching NWB files by submitting a web form.
Collapse
Affiliation(s)
- Petr Ježek
- Faculty of Applied Sciences, New Technologies for the Information Society, University of West Bohemia, Plzeň, Czechia
| | - Jeffery L Teeters
- Redwood Center for Theoretical Neuroscience & Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, CA, United States
| | - Friedrich T Sommer
- Redwood Center for Theoretical Neuroscience & Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, CA, United States
| |
Collapse
|
4
|
Bernstein HJ, Förster A, Bhowmick A, Brewster AS, Brockhauser S, Gelisio L, Hall DR, Leonarski F, Mariani V, Santoni G, Vonrhein C, Winter G. Gold Standard for macromolecular crystallography diffraction data. IUCrJ 2020; 7:784-792. [PMID: 32939270 PMCID: PMC7467160 DOI: 10.1107/s2052252520008672] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/06/2020] [Accepted: 06/26/2020] [Indexed: 06/11/2023]
Abstract
Macromolecular crystallography (MX) is the dominant means of determining the three-dimensional structures of biological macromolecules. Over the last few decades, most MX data have been collected at synchrotron beamlines using a large number of different detectors produced by various manufacturers and taking advantage of various protocols and goniometries. These data came in their own formats: sometimes proprietary, sometimes open. The associated metadata rarely reached the degree of completeness required for data management according to Findability, Accessibility, Interoperability and Reusability (FAIR) principles. Efforts to reuse old data by other investigators or even by the original investigators some time later were often frustrated. In the culmination of an effort dating back more than two decades, a large portion of the research community concerned with high data-rate macromolecular crystallography (HDRMX) has now agreed to an updated specification of data and metadata for diffraction images produced at synchrotron light sources and X-ray free-electron lasers (XFELs). This 'Gold Standard' will facilitate the processing of data sets independent of the facility at which they were collected and enable data archiving according to FAIR principles, with a particular focus on interoperability and reusability. This agreed standard builds on the NeXus/HDF5 NXmx application definition and the International Union of Crystallo-graphy (IUCr) imgCIF/CBF dictionary, and it is compatible with major data-processing programs and pipelines. Just as with the IUCr CBF/imgCIF standard from which it arose and to which it is tied, the NeXus/HDF5 NXmx Gold Standard application definition is intended to be applicable to all detectors used for crystallography, and all hardware and software developers in the field are encouraged to adopt and contribute to the standard.
Collapse
Affiliation(s)
- Herbert J. Bernstein
- Ronin Institute for Independent Scholarship, c/o NSLS II, Brookhaven National Laboratory, Upton, New York, USA
| | | | - Asmit Bhowmick
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Aaron S. Brewster
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Sandor Brockhauser
- European XFEL GmbH, Holzkoppel 4, 22869 Schenefeld, Germany
- Biological Research Centre Szeged (BRC), Temesvári krt. 62, 6726 Szeged, Hungary
- University of Szeged, Arpad ter 2, 6720 Szeged, Hungary
| | - Luca Gelisio
- Center for Free-Electron Laser Science, Notkestrasse 85, 22607 Hamburg, Germany
| | - David R. Hall
- Diamond Light Source Ltd, Harwell Science and Innovation Campus, Didcot OX11 0DE, United Kingdom
| | - Filip Leonarski
- Swiss Light Source, Paul Scherrer Institut, Forschungsstrasse 111, 5232 Villigen PSI, Switzerland
| | - Valerio Mariani
- Center for Free-Electron Laser Science, Notkestrasse 85, 22607 Hamburg, Germany
| | - Gianluca Santoni
- Structural Biology Group, European Synchrotron Radiation Facility, 71 Avenue des Martyrs, 38000 Grenoble, France
| | - Clemens Vonrhein
- Global Phasing Ltd, Sheraton House, Castle Park, Cambridge CB3 0AX, United Kingdom
| | - Graeme Winter
- Diamond Light Source Ltd, Harwell Science and Innovation Campus, Didcot OX11 0DE, United Kingdom
| |
Collapse
|
5
|
Knopp T, Szwargulski P, Griese F, Gräser M. OpenMPIData: An initiative for freely accessible magnetic particle imaging data. Data Brief 2019; 28:104971. [PMID: 31890809 PMCID: PMC6928334 DOI: 10.1016/j.dib.2019.104971] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 12/02/2019] [Accepted: 12/03/2019] [Indexed: 11/15/2022] Open
Abstract
Magnetic particle imaging is a tomographic imaging technique capable of measuring the local concentration of magnetic nanoparticles that can be used as tracers in biomedical applications. Since MPI is still at a very early stage of development, there are only a few MPI systems worldwide that are primarily operated by technical research groups that develop the systems themselves. It is therefore difficult for researchers without direct access to an MPI system to obtain experimental MPI data. The purpose of the OpenMPIData initiative is to make experimental MPI data freely accessible via a web platform. Measurements are performed with multiple phantoms and different image sequences from 1D to 3D. The datasets are stored in the magnetic particle image data format (MDF), an open document standard for storing MPI data. The open data is mainly intended for mathematicians and algorithm developers working on new reconstruction algorithms. Each dataset is designed to pose a specific challenge to image reconstruction. In addition to the measurement data, computer aided design (CAD) drawings of the phantoms are also provided so that the exact dimensions of the particle concentrations are known. Thus, the phantoms can be reproduced by other research groups using additive manufacturing. These reproduced phantoms can be used to compare different MPI systems.
Collapse
Affiliation(s)
- Tobias Knopp
- Section for Biomedical Imaging, University Medical Center Hamburg-Eppendorf, Germany.,Institute for Biomedical Imaging, Hamburg University of Technology, Germany
| | - Patryk Szwargulski
- Section for Biomedical Imaging, University Medical Center Hamburg-Eppendorf, Germany.,Institute for Biomedical Imaging, Hamburg University of Technology, Germany
| | - Florian Griese
- Section for Biomedical Imaging, University Medical Center Hamburg-Eppendorf, Germany.,Institute for Biomedical Imaging, Hamburg University of Technology, Germany
| | - Matthias Gräser
- Section for Biomedical Imaging, University Medical Center Hamburg-Eppendorf, Germany.,Institute for Biomedical Imaging, Hamburg University of Technology, Germany
| |
Collapse
|
6
|
Tritt AJ, Rübel O, Dichter B, Ly R, Kang D, Chang EF, Frank LM, Bouchard K. HDMF: Hierarchical Data Modeling Framework for Modern Science Data Standards. Proc IEEE Int Conf Big Data 2019; 2019:165-179. [PMID: 34632466 PMCID: PMC8500680 DOI: 10.1109/bigdata47090.2019.9005648] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
A ubiquitous problem in aggregating data across different experimental and observational data sources is a lack of software infrastructure that enables flexible and extensible standardization of data and metadata. To address this challenge, we developed HDMF, a hierarchical data modeling framework for modern science data standards. With HDMF, we separate the process of data standardization into three main components: (1) data modeling and specification, (2) data I/O and storage, and (3) data interaction and data APIs. To enable standards to support the complex requirements and varying use cases throughout the data life cycle, HDMF provides object mapping infrastructure to insulate and integrate these various components. This approach supports the flexible development of data standards and extensions, optimized storage backends, and data APIs, while allowing the other components of the data standards ecosystem to remain stable. To meet the demands of modern, large-scale science data, HDMF provides advanced data I/O functionality for iterative data write, lazy data load, and parallel I/O. It also supports optimization of data storage via support for chunking, compression, linking, and modular data storage. We demonstrate the application of HDMF in practice to design NWB 2.0 [13], a modern data standard for collaborative science across the neurophysiology community.
Collapse
Affiliation(s)
- Andrew J Tritt
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Oliver Rübel
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Benjamin Dichter
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Ryan Ly
- Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Donghe Kang
- Computer Science and Engineering, Ohio State University, Columbus, OH, USA
| | - Edward F Chang
- Department of Neurological Surgery and the Center for Integrative Neuroscience, University of California, San Francisco, San Francisco, CA, USA
| | - Loren M Frank
- Howard Hughes Medical Institute, Kavli Institute for Fundamental Neuroscience, Department of Physiology, University of California, San Francisco, San Francisco, CA
| | - Kristofer Bouchard
- Biological Systems and Engineering, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| |
Collapse
|
7
|
Gopaulakrishnan S, Pollack S, Stubbs BJ, Pagès H, Readey J, Davis S, Waldron L, Morgan M, Carey V. restfulSE: A semantically rich interface for cloud-scale genomics with Bioconductor. F1000Res 2019; 8:21. [PMID: 30828438 PMCID: PMC6392152 DOI: 10.12688/f1000research.17518.1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/19/2018] [Indexed: 11/20/2022] Open
Abstract
Bioconductor's SummarizedExperiment class unites numerical assay quantifications with sample- and experiment-level metadata. SummarizedExperiment is the standard Bioconductor class for assays that produce matrix-like data, used by over 200 packages. We describe the restfulSE package, a deployment of this data model that supports remote storage. We illustrate use of SummarizedExperiment with remote HDF5 and Google BigQuery back ends, with two applications in cancer genomics. Our intent is to allow the use of familiar and semantically meaningful programmatic idioms to query genomic data, while abstracting the remote interface from end users and developers.
Collapse
Affiliation(s)
- Shweta Gopaulakrishnan
- Channing Division of Network Medicine, Harvard Medical School, Boston, Massachusetts, 02115, USA
| | - Samuela Pollack
- Channing Division of Network Medicine, Harvard Medical School, Boston, Massachusetts, 02115, USA.,Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, 02115, USA
| | - B J Stubbs
- Channing Division of Network Medicine, Harvard Medical School, Boston, Massachusetts, 02115, USA
| | - Hervé Pagès
- Fred Hutchinson Cancer Research Center, Seattle, Washington, 98109, USA
| | - John Readey
- Tools and Cloud Technology, HDF Group, Seattle, WA, 98109, USA
| | - Sean Davis
- Center for Cancer Research, National Cancer Institute, USA, Bethesda, Maryland, 20892, USA
| | - Levi Waldron
- Epidemiology and Biostatistics, CUNY School of Public Health, New York, New York, 10027, USA
| | - Martin Morgan
- Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, New York, 14203, USA
| | - Vincent Carey
- Channing Division of Network Medicine, Harvard Medical School, Boston, Massachusetts, 02115, USA
| |
Collapse
|
8
|
Cabeleira M, Ercole A, Smielewski P. HDF5-Based Data Format for Archiving Complex Neuro-monitoring Data in Traumatic Brain Injury Patients. Acta Neurochir Suppl 2018; 126:121-5. [PMID: 29492546 DOI: 10.1007/978-3-319-65798-1_26] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/25/2024]
Abstract
OBJECTIVES Modern neuro-critical care units generate high volumes of data. These data originate from a multitude of devices in various formats and levels of granularity. We present a new data format intended to store these data in an ordered and homogenous way. MATERIAL AND METHODS The adopted data format was based on the hierarchical model, HDF5, which is capable of dealing with a mixture of small and very large datasets with equal ease. It is possible to access and manipulate individual data elements directly within a single file, and this is extensible and versatile. RESULTS The file structure that was agreed divided the patient data into four different groups: 'Annotations' for clinical events and sporadic observations, 'Numerics' for all the low-frequency data, 'Waves' for all the high-frequency data and 'Summaries' for the trend data and calculated parameters. The addition of attributes to every group and dataset makes the file self-described. More than 200 files have been successfully collected and stored using this format. CONCLUSION The new file format was implemented in ICM+ software and validated as part of a collaboration with participating centres across Europe.
Collapse
|
9
|
Abstract
A major challenge in experimental data analysis is the validation of analytical methods in a fully controlled scenario where the justification of the interpretation can be made directly and not just by plausibility. In some sciences, this could be a mathematical proof, yet biological systems usually do not satisfy assumptions of mathematical theorems. One solution is to use simulations of realistic models to generate ground truth data. In neuroscience, creating such data requires plausible models of neural activity, access to high performance computers, expertise and time to prepare and run the simulations, and to process the output. To facilitate such validation tests of analytical methods we provide rich data sets including intracellular voltage traces, transmembrane currents, morphologies, and spike times. Moreover, these data can be used to study the effects of different tissue models on the measurement. The data were generated using the largest publicly available multicompartmental model of thalamocortical network (Traub et al., Journal of Neurophysiology, 93(4), 2194–2232 (Traub et al. 2005)), with activity evoked by different thalamic stimuli.
Collapse
Affiliation(s)
- Helena Głąbska
- Department of Neurophysiology, Nencki Institute of Experimental Biology of Polish Academy of Sciences, Warsaw, Poland
| | - Chaitanya Chintaluri
- Department of Neurophysiology, Nencki Institute of Experimental Biology of Polish Academy of Sciences, Warsaw, Poland
| | - Daniel K Wójcik
- Department of Neurophysiology, Nencki Institute of Experimental Biology of Polish Academy of Sciences, Warsaw, Poland.
| |
Collapse
|
10
|
Askenazi M, Ben Hamidane H, Graumann J. The arc of Mass Spectrometry Exchange Formats is long, but it bends toward HDF5. Mass Spectrom Rev 2017; 36:668-673. [PMID: 27741559 PMCID: PMC6088231 DOI: 10.1002/mas.21522] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Revised: 06/28/2016] [Accepted: 08/18/2016] [Indexed: 05/18/2023]
Abstract
The evolution of data exchange in Mass Spectrometry spans decades and has ranged from human-readable text files representing individual scans or collections thereof (McDonald et al., 2004) through the official standard XML-based (Harold, Means, & Udemadu, 2005) data interchange standard (Deutsch, 2012), to increasingly compressed (Teleman et al., 2014) variants of this standard sometimes requiring purely binary adjunct files (Römpp et al., 2011). While the desire to maintain even partial human readability is understandable, the inherent mismatch between XML's textual and irregular format relative to the numeric and highly regular nature of actual spectral data, along with the explosive growth in dataset scales and the resulting need for efficient (binary and indexed) access has led to a phenomenon referred to as "technical drift" (Davis, 2013). While the drift is being continuously corrected using adjunct formats, compression schemes, and programs (Röst et al., 2015), we propose that the future of Mass Spectrometry Exchange Formats lies in the continued reliance and development of the PSI-MS (Mayer et al., 2014) controlled vocabulary, along with an expedited shift to an alternative, thriving and well-supported ecosystem for scientific data-exchange, storage, and access in binary form, namely that of HDF5 (Koranne, 2011). Indeed, pioneering efforts to leverage this universal, binary, and hierarchical data-format have already been published (Wilhelm et al., 2012; Rübel et al., 2013) though they have under-utilized self-description, a key property shared by HDF5 and XML. We demonstrate that a straightforward usage of plain ("vanilla") HDF5 yields immediate returns including, but not limited to, highly efficient data access, platform independent data viewers, a variety of libraries (Collette, 2014) for data retrieval and manipulation in many programming languages and remote data access through comprehensive RESTful data-servers. © 2016 Wiley Periodicals, Inc. Mass Spec Rev 36:668-673, 2017.
Collapse
|
11
|
Vincent RD, Neelin P, Khalili-Mahani N, Janke AL, Fonov VS, Robbins SM, Baghdadi L, Lerch J, Sled JG, Adalat R, MacDonald D, Zijdenbos AP, Collins DL, Evans AC. MINC 2.0: A Flexible Format for Multi-Modal Images. Front Neuroinform 2016; 10:35. [PMID: 27563289 PMCID: PMC4980430 DOI: 10.3389/fninf.2016.00035] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2016] [Accepted: 07/27/2016] [Indexed: 11/16/2022] Open
Abstract
It is often useful that an imaging data format can afford rich metadata, be flexible, scale to very large file sizes, support multi-modal data, and have strong inbuilt mechanisms for data provenance. Beginning in 1992, MINC was developed as a system for flexible, self-documenting representation of neuroscientific imaging data with arbitrary orientation and dimensionality. The MINC system incorporates three broad components: a file format specification, a programming library, and a growing set of tools. In the early 2000's the MINC developers created MINC 2.0, which added support for 64-bit file sizes, internal compression, and a number of other modern features. Because of its extensible design, it has been easy to incorporate details of provenance in the header metadata, including an explicit processing history, unique identifiers, and vendor-specific scanner settings. This makes MINC ideal for use in large scale imaging studies and databases. It also makes it easy to adapt to new scanning sequences and modalities.
Collapse
Affiliation(s)
- Robert D Vincent
- McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada
| | | | - Najmeh Khalili-Mahani
- McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada
| | - Andrew L Janke
- Center for Advanced Imaging, The University of Queensland Brisbane, QLD, Australia
| | - Vladimir S Fonov
- McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada
| | - Steven M Robbins
- McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada
| | - Leila Baghdadi
- Mouse Imaging Centre, The Hospital for Sick Children Toronto, ON, Canada
| | - Jason Lerch
- Mouse Imaging Centre, The Hospital for Sick ChildrenToronto, ON, Canada; Department of Medical Biophysics, University of TorontoToronto, ON, Canada
| | - John G Sled
- Mouse Imaging Centre, The Hospital for Sick ChildrenToronto, ON, Canada; Department of Medical Biophysics, University of TorontoToronto, ON, Canada
| | - Reza Adalat
- McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada
| | | | | | - D Louis Collins
- McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill UniversityMontreal, QC, Canada; Department of Biomedical Engineering, McGill UniversityMontreal, QC, Canada
| | - Alan C Evans
- McConnell Brain Imaging Centre, Montreal Neurological Institute, McGill University Montreal, QC, Canada
| |
Collapse
|
12
|
Schmitz GJ, Böttger B, Apel M, Eiken J, Laschet G, Altenfeld R, Berger R, Boussinot G, Viardin A. Towards a metadata scheme for the description of materials - the description of microstructures. Sci Technol Adv Mater 2016; 17:410-430. [PMID: 27877892 PMCID: PMC5111567 DOI: 10.1080/14686996.2016.1194166] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/12/2016] [Revised: 05/12/2016] [Accepted: 05/23/2016] [Indexed: 06/01/2023]
Abstract
The property of any material is essentially determined by its microstructure. Numerical models are increasingly the focus of modern engineering as helpful tools for tailoring and optimization of custom-designed microstructures by suitable processing and alloy design. A huge variety of software tools is available to predict various microstructural aspects for different materials. In the general frame of an integrated computational materials engineering (ICME) approach, these microstructure models provide the link between models operating at the atomistic or electronic scales, and models operating on the macroscopic scale of the component and its processing. In view of an improved interoperability of all these different tools it is highly desirable to establish a standardized nomenclature and methodology for the exchange of microstructure data. The scope of this article is to provide a comprehensive system of metadata descriptors for the description of a 3D microstructure. The presented descriptors are limited to a mere geometric description of a static microstructure and have to be complemented by further descriptors, e.g. for properties, numerical representations, kinetic data, and others in the future. Further attributes to each descriptor, e.g. on data origin, data uncertainty, and data validity range are being defined in ongoing work. The proposed descriptors are intended to be independent of any specific numerical representation. The descriptors defined in this article may serve as a first basis for standardization and will simplify the data exchange between different numerical models, as well as promote the integration of experimental data into numerical models of microstructures. An HDF5 template data file for a simple, three phase Al-Cu microstructure being based on the defined descriptors complements this article.
Collapse
Affiliation(s)
| | | | - Markus Apel
- MICRESS® group at Access e.V., Aachen, Germany
| | - Janin Eiken
- MICRESS® group at Access e.V., Aachen, Germany
| | | | | | - Ralf Berger
- MICRESS® group at Access e.V., Aachen, Germany
| | | | | |
Collapse
|
13
|
Ingargiola A, Laurence T, Boutelle R, Weiss S, Michalet X. Photon- HDF5: Open Data Format and Computational Tools for Timestamp-based Single-Molecule Experiments. Proc SPIE Int Soc Opt Eng 2016; 9714:971405. [PMID: 28649160 PMCID: PMC5479430 DOI: 10.1117/12.2212085] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
Archival of experimental data in public databases has increasingly become a requirement for most funding agencies and journals. These data-sharing policies have the potential to maximize data reuse, and to enable confirmatory as well as novel studies. However, the lack of standard data formats can severely hinder data reuse. In photon-counting-based single-molecule fluorescence experiments, data is stored in a variety of vendor-specific or even setup-specific (custom) file formats, making data interchange prohibitively laborious, unless the same hardware-software combination is used. Moreover, the number of available techniques and setup configurations make it difficult to find a common standard. To address this problem, we developed Photon-HDF5 (www.photon-hdf5.org), an open data format for timestamp-based single-molecule fluorescence experiments. Building on the solid foundation of HDF5, Photon-HDF5 provides a platform- and language-independent, easy-to-use file format that is self-describing and supports rich metadata. Photon-HDF5 supports different types of measurements by separating raw data (e.g. photon-timestamps, detectors, etc) from measurement metadata. This approach allows representing several measurement types and setup configurations within the same core structure and makes possible extending the format in backward-compatible way. Complementing the format specifications, we provide open source software to create and convert Photon-HDF5 files, together with code examples in multiple languages showing how to read Photon-HDF5 files. Photon-HDF5 allows sharing data in a format suitable for long term archival, avoiding the effort to document custom binary formats and increasing interoperability with different analysis software. We encourage participation of the single-molecule community to extend interoperability and to help defining future versions of Photon-HDF5.
Collapse
Affiliation(s)
- Antonino Ingargiola
- Dept. of Chemistry and Biochemistry, University of California Los Angeles, Los Angeles, California, USA
| | - Ted Laurence
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, USA
| | - Robert Boutelle
- Dept. of Chemistry and Biochemistry, University of California Los Angeles, Los Angeles, California, USA
| | - Shimon Weiss
- Dept. of Chemistry and Biochemistry, University of California Los Angeles, Los Angeles, California, USA
| | - Xavier Michalet
- Dept. of Chemistry and Biochemistry, University of California Los Angeles, Los Angeles, California, USA
| |
Collapse
|
14
|
Könnecke M, Akeroyd FA, Bernstein HJ, Brewster AS, Campbell SI, Clausen B, Cottrell S, Hoffmann JU, Jemian PR, Männicke D, Osborn R, Peterson PF, Richter T, Suzuki J, Watts B, Wintersberger E, Wuttke J. The NeXus data format. J Appl Crystallogr 2015; 48:301-305. [PMID: 26089752 PMCID: PMC4453170 DOI: 10.1107/s1600576714027575] [Citation(s) in RCA: 73] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Accepted: 12/17/2014] [Indexed: 11/10/2022] Open
Abstract
NeXus is an effort by an international group of scientists to define a common data exchange and archival format for neutron, X-ray and muon experiments. NeXus is built on top of the scientific data format HDF5 and adds domain-specific rules for organizing data within HDF5 files, in addition to a dictionary of well defined domain-specific field names. The NeXus data format has two purposes. First, it defines a format that can serve as a container for all relevant data associated with a beamline. This is a very important use case. Second, it defines standards in the form of application definitions for the exchange of data between applications. NeXus provides structures for raw experimental data as well as for processed data.
Collapse
Affiliation(s)
- Mark Könnecke
- Laboratory for Development and Methods, Paul Scherrer Institute, 5232 Villigen-PSI, Switzerland
| | - Frederick A Akeroyd
- ISIS Facility, STFC, Rutherford Appleton Laboratory, Didcot, Oxfordshire OX11 0QX, England
| | | | | | - Stuart I Campbell
- Spallation Neutron Source, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Björn Clausen
- Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| | - Stephen Cottrell
- ISIS Facility, STFC, Rutherford Appleton Laboratory, Didcot, Oxfordshire OX11 0QX, England
| | - Jens Uwe Hoffmann
- Helmholtz-Zentrum Berlin für Materialien und Energie Gmb, 14109 Berlin, Germany
| | - Pete R Jemian
- Advanced Photon Source, Argonne National Laboratory, Argonne, IL 60439, USA
| | | | | | - Peter F Peterson
- Spallation Neutron Source, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Tobias Richter
- Diamond Light Source, Didcot, Oxfordshire OX11 0DE, England
| | | | - Benjamin Watts
- Swiss Light Source, Paul Scherrer Institute, 5232 Villigen-PSI, Switzerland
| | | | - Joachim Wuttke
- Forschungszentrum Jülich, JCNS at MLZ, 85747 Garching, Germany
| |
Collapse
|
15
|
De Carlo F, Gürsoy D, Marone F, Rivers M, Parkinson DY, Khan F, Schwarz N, Vine DJ, Vogt S, Gleber SC, Narayanan S, Newville M, Lanzirotti T, Sun Y, Hong YP, Jacobsen C. Scientific data exchange: a schema for HDF5-based storage of raw and analyzed data. J Synchrotron Radiat 2014; 21:1224-30. [PMID: 25343788 DOI: 10.1107/s160057751401604x] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/27/2014] [Accepted: 07/09/2014] [Indexed: 05/10/2023]
Abstract
Data Exchange is a simple data model designed to interface, or `exchange', data among different instruments, and to enable sharing of data analysis tools. Data Exchange focuses on technique rather than instrument descriptions, and on provenance tracking of analysis steps and results. In this paper the successful application of the Data Exchange model to a variety of X-ray techniques, including tomography, fluorescence spectroscopy, fluorescence tomography and photon correlation spectroscopy, is described.
Collapse
Affiliation(s)
- Francesco De Carlo
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Doga Gürsoy
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Federica Marone
- Swiss Light Source, Paul Scherrer Institut, Villigen, Switzerland
| | - Mark Rivers
- The University of Chicago, Center for Advanced Radiation Sources, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | | | - Faisal Khan
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Nicholas Schwarz
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - David J Vine
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Stefan Vogt
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Sophie-Charlotte Gleber
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Suresh Narayanan
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Matt Newville
- The University of Chicago, Center for Advanced Radiation Sources, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Tony Lanzirotti
- The University of Chicago, Center for Advanced Radiation Sources, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| | - Yue Sun
- Department of Physics and Astronomy, Northwestern University, 2145 Sheridan Road, Evanston, IL 60208, USA
| | - Young Pyo Hong
- Department of Physics and Astronomy, Northwestern University, 2145 Sheridan Road, Evanston, IL 60208, USA
| | - Chris Jacobsen
- Advanced Photon Source, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439, USA
| |
Collapse
|
16
|
Mayer G, Jones AR, Binz PA, Deutsch EW, Orchard S, Montecchi-Palazzi L, Vizcaíno JA, Hermjakob H, Oveillero D, Julian R, Stephan C, Meyer HE, Eisenacher M. Controlled vocabularies and ontologies in proteomics: overview, principles and practice. Biochim Biophys Acta 2014; 1844:98-107. [PMID: 23429179 PMCID: PMC3898906 DOI: 10.1016/j.bbapap.2013.02.017] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Revised: 02/05/2013] [Accepted: 02/09/2013] [Indexed: 11/30/2022]
Abstract
This paper focuses on the use of controlled vocabularies (CVs) and ontologies especially in the area of proteomics, primarily related to the work of the Proteomics Standards Initiative (PSI). It describes the relevant proteomics standard formats and the ontologies used within them. Software and tools for working with these ontology files are also discussed. The article also examines the "mapping files" used to ensure correct controlled vocabulary terms that are placed within PSI standards and the fulfillment of the MIAPE (Minimum Information about a Proteomics Experiment) requirements. This article is part of a Special Issue entitled: Computational Proteomics in the Post-Identification Era. Guest Editors: Martin Eisenacher and Christian Stephan.
Collapse
Affiliation(s)
- Gerhard Mayer
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| | - Andrew R. Jones
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK
| | - Pierre-Alain Binz
- SIB Swiss Institute of Bioinformatics, Swiss-Prot group, Rue Michel-Servet 1, CH-1211 Geneva 4, Switzerland
| | - Eric W. Deutsch
- Institute for Systems Biology, 401 Terry Avenue North, Seattle, WA 98109, USA
| | - Sandra Orchard
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | | | - Henning Hermjakob
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - David Oveillero
- EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | - Christian Stephan
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
- Kairos GmbH, Universitätsstraße 136, D-44799 Bochum, Germany
| | - Helmut E. Meyer
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| | - Martin Eisenacher
- Medizinisches Proteom Center (MPC), Ruhr-Universität Bochum, D-44801 Bochum, Germany
| |
Collapse
|