Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Deorowicz S, Grabowski S. Data compression for sequencing data. Algorithms Mol Biol 2013;8:25. [PMID: 24252160 PMCID: PMC3868316 DOI: 10.1186/1748-7188-8-25] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2013] [Accepted: 09/25/2013] [Indexed: 12/12/2022] Open

For:	Deorowicz S, Grabowski S. Data compression for sequencing data. Algorithms Mol Biol 2013;8:25. [PMID: 24252160 PMCID: PMC3868316 DOI: 10.1186/1748-7188-8-25] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2013] [Accepted: 09/25/2013] [Indexed: 12/12/2022] Open

Number

Cited by Other Article(s)

Roy S, Mukhopadhyay A. A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression. Gene 2024;907:148235. [PMID: 38342250 DOI: 10.1016/j.gene.2024.148235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 01/13/2024] [Accepted: 01/30/2024] [Indexed: 02/13/2024]

Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.15.536996. [PMID: 37131636 PMCID: PMC10153118 DOI: 10.1101/2023.04.15.536996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023;24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open

Kryukov K, Jin L, Nakagawa S. Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format. PATTERNS (NEW YORK, N.Y.) 2022;3:100562. [PMID: 35818472 PMCID: PMC9259476 DOI: 10.1016/j.patter.2022.100562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]

A Hybrid Data-Differencing and Compression Algorithm for the Automotive Industry. ENTROPY 2022;24:e24050574. [PMID: 35626459 PMCID: PMC9140898 DOI: 10.3390/e24050574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 04/08/2022] [Accepted: 04/14/2022] [Indexed: 11/29/2022]

Kryukov K, Ueda MT, Nakagawa S, Imanishi T. Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. Gigascience 2021;9:5867695. [PMID: 32627830 PMCID: PMC7336184 DOI: 10.1093/gigascience/giaa072] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 06/01/2020] [Accepted: 06/15/2020] [Indexed: 01/22/2023] Open

Sardaraz M, Tahir M. SCA-NGS: Secure compression algorithm for next generation sequencing data using genetic operators and block sorting. Sci Prog 2021;104:368504211023276. [PMID: 34143692 PMCID: PMC10454964 DOI: 10.1177/00368504211023276] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Karcioglu AA, Bulut H. Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences. Comput Biol Med 2021;131:104292. [PMID: 33662682 DOI: 10.1016/j.compbiomed.2021.104292] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 02/16/2021] [Accepted: 02/16/2021] [Indexed: 12/22/2022]

Liu Y, Wong L, Li J. Allowing mutations in maximal matches boosts genome compression performance. Bioinformatics 2020;36:4675-4681. [PMID: 33118018 DOI: 10.1093/bioinformatics/btaa572] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2020] [Revised: 05/05/2020] [Accepted: 06/10/2020] [Indexed: 01/23/2023] Open

Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020;35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open

Abstract

MOTIVATION

Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs.

RESULTS

We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs.

AVAILABILITY AND IMPLEMENTATION

https://github.com/yuansliu/minicom.

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review. PLoS One 2020;15:e0232942. [PMID: 32453750 PMCID: PMC7250429 DOI: 10.1371/journal.pone.0232942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 04/25/2020] [Indexed: 11/19/2022] Open

Clinical application of genomic high-throughput data: Infrastructural, ethical, legal and psychosocial aspects. Eur Neuropsychopharmacol 2020;31:1-15. [PMID: 31866110 DOI: 10.1016/j.euroneuro.2019.09.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Revised: 11/03/2018] [Accepted: 09/20/2019] [Indexed: 12/28/2022]

FQSqueezer: k-mer-based compression of sequencing data. Sci Rep 2020;10:578. [PMID: 31953467 PMCID: PMC6969201 DOI: 10.1038/s41598-020-57452-6] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2019] [Accepted: 12/19/2019] [Indexed: 12/26/2022] Open

Buchmann J, Geihs M, Hamacher K, Katzenbeisser S, Stammler S. Long-term integrity protection of genomic data. EURASIP JOURNAL ON INFORMATION SECURITY 2019. [DOI: 10.1186/s13635-019-0099-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Roguski L, Ochoa I, Hernaez M, Deorowicz S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 2019;34:2748-2756. [PMID: 29617939 DOI: 10.1093/bioinformatics/bty205] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Accepted: 03/27/2018] [Indexed: 12/29/2022] Open

Hernaez M, Pavlichin D, Weissman T, Ochoa I. Genomic Data Compression. Annu Rev Biomed Data Sci 2019. [DOI: 10.1146/annurev-biodatasci-072018-021229] [Citation(s) in RCA: 29] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Danek A, Deorowicz S. GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics 2019;34:1834-1840. [PMID: 29351600 DOI: 10.1093/bioinformatics/bty023] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Accepted: 01/15/2018] [Indexed: 12/20/2022] Open

Deorowicz S, Debudaj-Grabysz A, Gudyś A, Grabowski S. Whisper: read sorting allows robust mapping of DNA sequencing data. Bioinformatics 2019;35:2043-2050. [PMID: 30407485 DOI: 10.1093/bioinformatics/bty927] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 10/16/2018] [Accepted: 11/06/2018] [Indexed: 11/13/2022] Open

Najam M, Rasool RU, Ahmad HF, Ashraf U, Malik AW. Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters. BIOMED RESEARCH INTERNATIONAL 2019;2019:7074387. [PMID: 31111064 PMCID: PMC6487161 DOI: 10.1155/2019/7074387] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 03/29/2019] [Accepted: 03/31/2019] [Indexed: 12/24/2022]

Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights 2019;13:1177932218821373. [PMID: 30792576 PMCID: PMC6376532 DOI: 10.1177/1177932218821373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 11/26/2018] [Indexed: 11/16/2022] Open

Sardaraz M, Tahir M. FCompress: An Algorithm for FASTQ Sequence Data Compression. Curr Bioinform 2019. [DOI: 10.2174/1574893613666180322125337] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Cheng KO, Law NF, Siu WC. Clustering-Based Compression for Population DNA Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019;16:208-221. [PMID: 29028207 DOI: 10.1109/tcbb.2017.2762302] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]

Wasik S, Szostak N, Kudla M, Wachowiak M, Krawiec K, Blazewicz J. Detecting life signatures with RNA sequence similarity measures. J Theor Biol 2018;463:110-120. [PMID: 30562502 DOI: 10.1016/j.jtbi.2018.12.018] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Revised: 10/25/2018] [Accepted: 12/14/2018] [Indexed: 12/20/2022]

Wandelt S, Sun X, Leser U. Column-wise compression of open relational data. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.04.074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]

Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. J Comput Biol 2018;25:825-836. [PMID: 30011247 DOI: 10.1089/cmb.2018.0068] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Deorowicz S, Walczyszyn J, Debudaj-Grabysz A. CoMSA: compression of protein multiple sequence alignment files. Bioinformatics 2018;35:227-234. [DOI: 10.1093/bioinformatics/bty619] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Accepted: 07/10/2018] [Indexed: 11/13/2022] Open

Lin J, Adjeroh DA, Jiang BH, Jiang Y. K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics. Bioinformatics 2018;34:1682-1689. [PMID: 29253072 PMCID: PMC6355110 DOI: 10.1093/bioinformatics/btx809] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 12/11/2017] [Accepted: 12/14/2017] [Indexed: 11/13/2022] Open

Cheng H, Wu M, Xu Y. FMtree: a fast locating algorithm of FM-indexes for genomic data. Bioinformatics 2018;34:416-424. [PMID: 28968761 DOI: 10.1093/bioinformatics/btx596] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Accepted: 09/16/2017] [Indexed: 11/15/2022] Open

Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-56970-3_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]

Comparison of high-throughput sequencing data compression tools. Nat Methods 2016;13:1005-1008. [PMID: 27776113 DOI: 10.1038/nmeth.4037] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/01/2016] [Indexed: 12/27/2022]

Ke R, Mignardi M, Hauling T, Nilsson M. Fourth Generation of Next-Generation Sequencing Technologies: Promise and Consequences. Hum Mutat 2016;37:1363-1367. [PMID: 27406789 PMCID: PMC5111608 DOI: 10.1002/humu.23051] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 06/08/2016] [Accepted: 07/07/2016] [Indexed: 01/02/2023]

Wu TD. Bitpacking techniques for indexing genomes: I. Hash tables. Algorithms Mol Biol 2016;11:5. [PMID: 27095998 PMCID: PMC4835851 DOI: 10.1186/s13015-016-0069-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Accepted: 04/01/2016] [Indexed: 11/20/2022] Open

Kim M, Zhang X, Ligo JG, Farnoud F, Veeravalli VV, Milenkovic O. MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinformatics 2016;17:94. [PMID: 26895947 PMCID: PMC4759986 DOI: 10.1186/s12859-016-0932-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 02/02/2016] [Indexed: 12/15/2022] Open

Abstract

BACKGROUND

Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1-10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression.

RESULTS

MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes.

CONCLUSIONS

We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline.

AVAILABILITY

The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.

Collapse

Sardaraz M, Tahir M, Ikram AA. Advances in high throughput DNA sequence data compression. J Bioinform Comput Biol 2015;14:1630002. [PMID: 26846812 DOI: 10.1142/s0219720016300021] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]

Cheng KO, Wu P, Law NF, Siu WC. Compression of Multiple DNA Sequences Using Intra-Sequence and Inter-Sequence Similarities. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015;12:1322-1332. [PMID: 26671804 DOI: 10.1109/tcbb.2015.2403370] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Wandelt S, Leser U. Sequence Factorization with Multiple References. PLoS One 2015;10:e0139000. [PMID: 26422374 PMCID: PMC4589410 DOI: 10.1371/journal.pone.0139000] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2014] [Accepted: 09/07/2015] [Indexed: 11/29/2022] Open

Abstract

The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

Collapse

Loeffelholz M, Fofanov Y. The main challenges that remain in applying high-throughput sequencing to clinical diagnostics. Expert Rev Mol Diagn 2015;15:1405-8. [DOI: 10.1586/14737159.2015.1088385] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Ochoa I, Hernaez M, Weissman T. Aligned genomic data compression via improved modeling. J Bioinform Comput Biol 2015;12:1442002. [PMID: 25395305 DOI: 10.1142/s0219720014420025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]

Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 2015;31:2770-7. [PMID: 25910696 PMCID: PMC4547610 DOI: 10.1093/bioinformatics/btv248] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2014] [Revised: 04/11/2015] [Accepted: 04/20/2015] [Indexed: 01/24/2023] Open

Alves F, Cogo V, Wandelt S, Leser U, Bessani A. On-Demand Indexing for Referential Compression of DNA Sequences. PLoS One 2015;10:e0132460. [PMID: 26146838 PMCID: PMC4493149 DOI: 10.1371/journal.pone.0132460] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2015] [Accepted: 06/15/2015] [Indexed: 12/15/2022] Open

Deorowicz S, Danek A, Niemiec M. GDC 2: Compression of large collections of genomes. Sci Rep 2015;5:11565. [PMID: 26108279 PMCID: PMC4479802 DOI: 10.1038/srep11565] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Accepted: 05/28/2015] [Indexed: 01/18/2023] Open

Pratas D, Silva RM, Pinho AJ, Ferreira PJ. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci Rep 2015;5:10203. [PMID: 25984837 PMCID: PMC4434998 DOI: 10.1038/srep10203] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 04/07/2015] [Indexed: 12/19/2022] Open

Matos LMO, Neves AJR, Pratas D, Pinho AJ. MAFCO: a compression tool for MAF files. PLoS One 2015;10:e0116082. [PMID: 25816229 PMCID: PMC4376647 DOI: 10.1371/journal.pone.0116082] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 12/05/2014] [Indexed: 01/03/2023] Open

Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. ACTA ACUST UNITED AC 2014;31:1389-95. [PMID: 25536966 DOI: 10.1093/bioinformatics/btu844] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 12/17/2014] [Indexed: 11/14/2022]

Ochoa I, Hernaez M, Weissman T. iDoComp: a compression scheme for assembled genomes. Bioinformatics 2014;31:626-33. [PMID: 25344501 DOI: 10.1093/bioinformatics/btu698] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Danek A, Deorowicz S, Grabowski S. Indexes of large genome collections on a PC. PLoS One 2014;9:e109384. [PMID: 25289699 PMCID: PMC4188820 DOI: 10.1371/journal.pone.0109384] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Accepted: 09/07/2014] [Indexed: 02/02/2023] Open

Reference-free SNP detection: dealing with the data deluge. BMC Genomics 2014;15 Suppl 4:S10. [PMID: 25056481 PMCID: PMC4083407 DOI: 10.1186/1471-2164-15-s4-s10] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

Roguski Ł, Deorowicz S. DSRC 2—Industry-oriented compression of FASTQ files. Bioinformatics 2014;30:2213-5. [DOI: 10.1093/bioinformatics/btu208] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open