1
|
Park SJ, Kim S, Jeong J, No A, No JS, Park H. Reducing cost in DNA-based data storage by sequence analysis-aided soft information decoding of variable-length reads. Bioinformatics 2023; 39:btad548. [PMID: 37669160 PMCID: PMC10500082 DOI: 10.1093/bioinformatics/btad548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 08/30/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION DNA-based data storage is one of the most attractive research areas for future archival storage. However, it faces the problems of high writing and reading costs for practical use. There have been many efforts to resolve this problem, but existing schemes are not fully suitable for DNA-based data storage, and more cost reduction is needed. RESULTS We propose whole encoding and decoding procedures for DNA storage. The encoding procedure consists of a carefully designed single low-density parity-check code as an inter-oligo code, which corrects errors and dropouts efficiently. We apply new clustering and alignment methods that operate on variable-length reads to aid the decoding performance. We use edit distance and quality scores during the sequence analysis-aided decoding procedure, which can discard abnormal reads and utilize high-quality soft information. We store 548.83 KB of an image file in DNA oligos and achieve a writing cost reduction of 7.46% and a significant reading cost reduction of 26.57% and 19.41% compared with the two previous works. AVAILABILITY AND IMPLEMENTATION Data and codes for all the algorithms proposed in this study are available at: https://github.com/sjpark0905/DNA-LDPC-codes.
Collapse
Affiliation(s)
- Seong-Joon Park
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
| | - Sunghwan Kim
- Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Ulsan 44610, South Korea
| | - Jaeho Jeong
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
| | - Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul 04066, South Korea
| | - Jong-Seon No
- Department of Electrical and Computer Engineering, Seoul National University, Seoul 08826, South Korea
| | - Hosung Park
- Department of Computer Engineering, Chonnam National University, Gwangju 61186, South Korea
- Department of ICT Convergence System Engineering, Chonnam National University, Gwangju 61186, South Korea
| |
Collapse
|
2
|
Abstract
BACKGROUND Advances in sequencing technology have drastically reduced sequencing costs. As a result, the amount of sequencing data increases explosively. Since FASTQ files (standard sequencing data formats) are huge, there is a need for efficient compression of FASTQ files, especially quality scores. Several quality scores compression algorithms are recently proposed, mainly focused on lossy compression to boost the compression rate further. However, for clinical applications and archiving purposes, lossy compression cannot replace lossless compression. One of the main challenges for lossless compression is time complexity, where it takes thousands of seconds to compress a 1 GB file. Also, there are desired features for compression algorithms, such as random access. Therefore, there is a need for a fast lossless compressor with a reasonable compression rate and random access functionality. RESULTS This paper proposes a Fast and Concurrent Lossless Quality scores Compressor (FCLQC) that supports random access and achieves a lower running time based on concurrent programming. Experimental results reveal that FCLQC is significantly faster than the baseline compressors on compression and decompression at the expense of compression ratio. Compared to LCQS (baseline quality score compression algorithm), FCLQC shows at least 31x compression speed improvement in all settings, where a performance degradation in compression ratio is up to 13.58% (8.26% on average). Compared to general-purpose compressors (such as 7-zip), FCLQC shows 3x faster compression speed while having better compression ratios, at least 2.08% (4.69% on average). Moreover, the speed of random access decompression also outperforms the others. The concurrency of FCLQC is implemented using Rust; the performance gain increases near-linearly with the number of threads. CONCLUSION The superiority of compression and decompression speed makes FCLQC a practical lossless quality score compressor candidate for speed-sensitive applications of DNA sequencing data. FCLQC is available at https://github.com/Minhyeok01/FCLQC and is freely available for non-commercial usage.
Collapse
Affiliation(s)
- Minhyeok Cho
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea
| | - Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Republic of Korea.
| |
Collapse
|
3
|
Jeong J, Park SJ, Kim JW, No JS, Jeon HH, Lee JW, No A, Kim S, Park H. Cooperative Sequence Clustering and Decoding for DNA Storage System with Fountain Codes. Bioinformatics 2021; 37:3136-3143. [PMID: 33904574 DOI: 10.1093/bioinformatics/btab246] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2020] [Revised: 03/03/2021] [Accepted: 04/13/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION In DNA storage systems, there are tradeoffs between writing and reading costs. Increasing the code rate of error-correcting codes may save writing cost, but it will need more sequence reads for data retrieval. There is potentially a way to improve sequencing and decoding processes in such a way that the reading cost induced by this tradeoff is reduced without increasing the writing cost. In past researches, clustering, alignment, and decoding processes were considered as separate stages but we believe that using the information from all these processes together may improve decoding performance. Actual experiments of DNA synthesis and sequencing should be performed because simulations cannot be relied on to cover all error possibilities in practical circumstances. RESULTS For DNA storage systems using fountain code and Reed-Solomon (RS) code, we introduce several techniques to improve the decoding performance. We designed the decoding process focusing on the cooperation of key components: Hamming-distance based clustering, discarding of abnormal sequence reads, RS error correction as well as detection, and quality score-based ordering of sequences. We synthesized 513.6KB data into DNA oligo pools and sequenced this data successfully with Illumina MiSeq instrument. Compared to Erlich's research, the proposed decoding method additionally incorporates sequence reads with minor errors which had been discarded before, and thuswas able to make use of 10.6-11.9% more sequence reads from the same sequencing environment, this resulted in 6.5-8.9% reduction in the reading cost. Channel characteristics including sequence coverage and read-length distributions are provided as well. AVAILABILITY The raw data files and the source codes of our experiments are available at: https://github.com/jhjeong0702/dna-storage.
Collapse
Affiliation(s)
- Jaeho Jeong
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Seong-Joon Park
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Jae-Won Kim
- Department of Electronic Engineering, Gyeongsang National University, Jinju, Korea
| | - Jong-Seon No
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Korea
| | - Ha Hyeon Jeon
- Department of Chemical Engineering, POSTECH, Pohang, Korea
| | - Jeong Wook Lee
- Department of Chemical Engineering, POSTECH, Pohang, Korea
| | - Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul, Korea
| | - Sunghwan Kim
- School of Electrical Engineering, University of Ulsan, Ulsan, Korea
| | - Hosung Park
- Department of Computer Engineering and Department of ICT Convergence System Engineering, Chonnam National University, Gwangju, Korea
| |
Collapse
|
4
|
No A, Hernaez M, Ochoa I. CROMqs: An infinitesimal successive refinement lossy compressor for the quality scores. J Bioinform Comput Biol 2020; 18:2050031. [PMID: 32938284 DOI: 10.1142/s0219720020500316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The amount of sequencing data is growing at a fast pace due to a rapid revolution in sequencing technologies. Quality scores, which indicate the reliability of each of the called nucleotides, take a significant portion of the sequencing data. In addition, quality scores are more challenging to compress than nucleotides, and they are often noisy. Hence, a natural solution to further decrease the size of the sequencing data is to apply lossy compression to the quality scores. Lossy compression may result in a loss in precision, however, it has been shown that when operating at some specific rates, lossy compression can achieve performance on variant calling similar to that achieved with the losslessly compressed data (i.e. the original data). We propose Coding with Random Orthogonal Matrices for quality scores (CROMqs), the first lossy compressor designed for the quality scores with the "infinitesimal successive refinability" property. With this property, the encoder needs to compress the data only once, at a high rate, while the decoder can decompress it iteratively. The decoder can reconstruct the set of quality scores at each step with reduced distortion each time. This characteristic is specifically useful in sequencing data compression, since the encoder does not generally know what the most appropriate rate of compression is, e.g. for not degrading variant calling accuracy. CROMqs avoids the need of having to compress the data at multiple rates, hence incurring time savings. In addition to this property, we show that CROMqs obtains a comparable rate-distortion performance to the state-of-the-art lossy compressors. Moreover, we also show that it achieves a comparable performance on variant calling to that of the lossless compressed data while achieving more than 50% reduction in size.
Collapse
Affiliation(s)
- Albert No
- Electronic and Electrical Engineering, Hongik University, 94 Wausan-ro, Mapo-gu, Seoul 04066, Korea
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 W Gregory Dr, Urbana, IL 61801, USA
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, 1308 W Main Street, Urbana, IL 61801, USA
| |
Collapse
|
5
|
No A. Universality of Logarithmic Loss in Fixed-Length Lossy Compression. Entropy (Basel) 2019; 21:e21060580. [PMID: 33267294 PMCID: PMC7515068 DOI: 10.3390/e21060580] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 06/05/2019] [Accepted: 06/08/2019] [Indexed: 12/02/2022]
Abstract
We established a universality of logarithmic loss over a finite alphabet as a distortion criterion in fixed-length lossy compression. For any fixed-length lossy-compression problem under an arbitrary distortion criterion, we show that there is an equivalent lossy-compression problem under logarithmic loss. The equivalence is in the strong sense that we show that finding good schemes in corresponding lossy compression under logarithmic loss is essentially equivalent to finding good schemes in the original problem. This equivalence relation also provides an algebraic structure in the reconstruction alphabet, which allows us to use known techniques in the clustering literature. Furthermore, our result naturally suggests a new clustering algorithm in the categorical data-clustering problem.
Collapse
Affiliation(s)
- Albert No
- Department of Electronic and Electrical Engineering, Hongik University, Seoul 04066, Korea
| |
Collapse
|
6
|
Ochoa I, No A, Hernaez M, Weissman T. CROMqs: an infinitesimal successive refinement lossy compressor for the quality scores. Proc Inf Theory Workshop 2016; 2016:121-125. [PMID: 29806047 DOI: 10.1109/itw.2016.7606808] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Much of the data are comprised of nucleotides and the corresponding quality scores that indicate their reliability. The latter are more difficult to compress and are themselves noisy. As a result, lossy compression of the quality scores has recently been proposed to alleviate the storage costs. Further, it has been shown that lossy compression, at some specific rates, can achieve a performance on variant calling similar to that achieved with the lossless compressed data. We propose CROMqs, a new lossy compressor for the quality scores with the property of "infinitesimal successive refinability". This property allows the decoder to decompress the data iteratively without the need of agreeing with the encoder on a specific rate prior to compression. This characteristic is particularly amenable in practice, as in most cases the appropriate rate at which the lossy compressor should operate can not be established prior to compression. Further, this property can be of interest in scenarios involving streaming of genomic data. CROMqs is the first infinitesimal successive refinement lossy compressor for the quality scores in the literature, and we show that it obtains a comparable rate-distortion performance to previously proposed algorithms. Moreover, we also show that CROMqs achieves a comparable performance on variant calling to that of the lossless compressed data.
Collapse
Affiliation(s)
- I Ochoa
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - A No
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - M Hernaez
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| | - T Weissman
- Department of Electrical Engineering, Stanford University, Stanford CA 94305
| |
Collapse
|
7
|
Abstract
We begin by presenting a simple lossy compressor operating at near-zero rate: The encoder merely describes the indices of the few maximal source components, while the decoder's reconstruction is a natural estimate of the source components based on this information. This scheme turns out to be near optimal for the memoryless Gaussian source in the sense of achieving the zero-rate slope of its distortion-rate function. Motivated by this finding, we then propose a scheme comprised of iterating the above lossy compressor on an appropriately transformed version of the difference between the source and its reconstruction from the previous iteration. The proposed scheme achieves the rate distortion function of the Gaussian memoryless source (under squared error distortion) when employed on any finite-variance ergodic source. It further possesses desirable properties, and we, respectively, refer to as infinitesimal successive refinability, ratelessness, and complete separability. Its storage and computation requirements are of order no more than (n2)/(log β n) per source symbol for β > 0 at both the encoder and the decoder. Though the details of its derivation, construction, and analysis differ considerably, we discuss similarities between the proposed scheme and the recently introduced Sparse Regression Codes of Venkataramanan et al.
Collapse
Affiliation(s)
- Albert No
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305 USA
| |
Collapse
|