Hanau F, Röst H, Ochoa I. mspack: efficient lossless and lossy mass spectrometry data compression.
Bioinformatics 2021;
37:3923-3925. [PMID:
34478503 DOI:
10.1093/bioinformatics/btab636]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 08/16/2021] [Accepted: 09/01/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION
Mass spectrometry data, used for proteomics and metabolomics analyses, have seen considerable growth in the last years. Aiming at reducing the associated storage costs, dedicated compression algorithms for Mass Spectrometry (MS) data have been proposed, such as MassComp and MSNumpress. However, these algorithms focus on either lossless or lossy compression, respectively, and do not exploit the additional redundancy existing across scans contained in a single file. We introduce mspack, a compression algorithm for MS data that exploits this additional redundancy and that supports both lossless and lossy compression, as well as the mzML and the legacy mzXML formats. mspack applies several preprocessing lossless transforms and optional lossy transforms with a configurable error, followed by the general purpose compressors gzip or bsc to achieve a higher compression ratio.
RESULTS
We tested mspack on several datasets generated by commonly used mass spectrometry instruments. When used with the bsc compression backend, mspack achieves on average 76% smaller file sizes for lossless compression and 94% smaller file sizes for lossy compression, as compared to the original files. Lossless mspack achieves 10 - 60% lower file sizes than MassComp, and lossy mspack compresses 36 - 60% better than the lossy MSNumpress, for the same error, while exhibiting comparable accuracy and running time.
AVAILABILITY
mspack is implemented in C ++ and freely available at https://github.com/fhanau/mspack under the Apache license.
SUPPLEMENTARY INFORMATION
Supplementary data are available at Bioinformatics online.
Collapse