1
|
Müntefering F, Adhisantoso YG, Chandak S, Ostermann J, Hernaez M, Voges J. Genie: the first open-source ISO/IEC encoder for genomic data. Commun Biol 2024; 7:553. [PMID: 38724695 PMCID: PMC11082222 DOI: 10.1038/s42003-024-06249-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 04/26/2024] [Indexed: 05/12/2024] Open
Abstract
For the last two decades, the amount of genomic data produced by scientific and medical applications has been growing at a rapid pace. To enable software solutions that analyze, process, and transmit these data in an efficient and interoperable way, ISO and IEC released the first version of the compression standard MPEG-G in 2019. However, non-proprietary implementations of the standard are not openly available so far, limiting fair scientific assessment of the standard and, therefore, hindering its broad adoption. In this paper, we present Genie, to the best of our knowledge the first open-source encoder that compresses genomic data according to the MPEG-G standard. We demonstrate that Genie reaches state-of-the-art compression ratios while offering interoperability with any other standard-compliant decoder independent from its manufacturer. Finally, the ISO/IEC ecosystem ensures the long-term sustainability and decodability of the compressed data through the ISO/IEC-supported reference decoder.
Collapse
Affiliation(s)
- Fabian Müntefering
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.
| | - Yeremia Gunawan Adhisantoso
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany
| | - Shubham Chandak
- Department of Electrical Engineering, Stanford University, 350 Jane Stanford Way, Stanford, CA, 94305, USA
| | - Jörn Ostermann
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany
| | - Mikel Hernaez
- Center for Applied Medical Research (CIMA), University of Navarra, Av. de Pío XII, 55, Pamplona, 31008, Navarra, Spain.
| | - Jan Voges
- Institut für Informationsverarbeitung (TNT), Leibniz University Hannover, Appelstraße 9a, Hannover, 30167, Germany.
| |
Collapse
|
2
|
Sun H, Zheng Y, Xie H, Ma H, Liu X, Wang G. PMFFRC: a large-scale genomic short reads compression optimizer via memory modeling and redundant clustering. BMC Bioinformatics 2023; 24:454. [PMID: 38036969 PMCID: PMC10691058 DOI: 10.1186/s12859-023-05566-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 11/13/2023] [Indexed: 12/02/2023] Open
Abstract
BACKGROUND Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC .
Collapse
Affiliation(s)
- Hui Sun
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Yingfeng Zheng
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Haonan Xie
- Institute of Artificial Intelligence, School of Electrical Engineering, Guangxi University, Nanning, China
| | - Huidong Ma
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China
| | - Xiaoguang Liu
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China.
| | - Gang Wang
- Nankai-Baidu Joint Laboratory, College of Computer Science, Nankai University, Tianjin, China.
| |
Collapse
|
3
|
Cracco A, Tomescu AI. Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT. Genome Res 2023; 33:1198-1207. [PMID: 37253540 PMCID: PMC10538363 DOI: 10.1101/gr.277615.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 05/16/2023] [Indexed: 06/01/2023]
Abstract
Compacted de Bruijn graphs are one of the most fundamental data structures in computational genomics. Colored compacted de Bruijn graphs are a variant built on a collection of sequences and associate to each k-mer the sequences in which it appears. We present GGCAT, a tool for constructing both types of graphs, based on a new approach merging the k-mer counting step with the unitig construction step, as well as on numerous practical optimizations. For compacted de Bruijn graph construction, GGCAT achieves speed-ups of 3× to 21× compared with the state-of-the-art tool Cuttlefish 2. When constructing the colored variant, GGCAT achieves speed-ups of 5× to 39× compared with the state-of-the-art tool BiFrost. Additionally, GGCAT is up to 480× faster than BiFrost for batch sequence queries on colored graphs.
Collapse
Affiliation(s)
- Andrea Cracco
- Department of Computer Science, University of Verona, 37134 Verona, Italy;
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, Helsinki 00560, Finland
| |
Collapse
|
4
|
Gibney D, Thankachan SV, Aluru S. On the Hardness of Sequence Alignment on De Bruijn Graphs. J Comput Biol 2022; 29:1377-1396. [DOI: 10.1089/cmb.2022.0411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Affiliation(s)
- Daniel Gibney
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Sharma V. Thankachan
- Department of Computer Science, North Carolina State University, Raleigh, North Carolina, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
5
|
Khan J, Kokot M, Deorowicz S, Patro R. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2. Genome Biol 2022; 23:190. [PMID: 36076275 PMCID: PMC9454175 DOI: 10.1186/s13059-022-02743-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 08/01/2022] [Indexed: 11/13/2022] Open
Abstract
The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17-23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54-58 h, using considerably more memory.
Collapse
Affiliation(s)
- Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| | - Marek Kokot
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, USA
| |
Collapse
|
6
|
Kryukov K, Jin L, Nakagawa S. Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format. PATTERNS 2022; 3:100562. [PMID: 35818472 PMCID: PMC9259476 DOI: 10.1016/j.patter.2022.100562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
7
|
Lee D, Song G. FastqCLS: a FASTQ compressor for long-read sequencing via read reordering using a novel scoring model. Bioinformatics 2022; 38:351-356. [PMID: 34623374 DOI: 10.1093/bioinformatics/btab696] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 09/29/2021] [Accepted: 10/05/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Over the past decades, vast amounts of genome sequencing data have been produced, requiring an enormous level of storage capacity. The time and resources needed to store and transfer such data cause bottlenecks in genome sequencing analysis. To resolve this issue, various compression techniques have been proposed to reduce the size of original FASTQ raw sequencing data, but these remain suboptimal. Long-read sequencing has become dominant in genomics, whereas most existing compression methods focus on short-read sequencing only. RESULTS We designed a compression algorithm based on read reordering using a novel scoring model for reducing FASTQ file size with no information loss. We integrated all data processing steps into a software package called FastqCLS and provided it as a Docker image for ease of installation and execution to help users easily install and run. We compared our method with existing major FASTQ compression tools using benchmark datasets. We also included new long-read sequencing data in this validation. As a result, FastqCLS outperformed in terms of compression ratios for storing long-read sequencing data. AVAILABILITY AND IMPLEMENTATION FastqCLS can be downloaded from https://github.com/krlucete/FastqCLS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dohyeon Lee
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| | - Giltae Song
- School of Computer Science and Engineering, Pusan National University, Busan 46241, South Korea
| |
Collapse
|
8
|
Kryukov K, Ueda MT, Nakagawa S, Imanishi T. Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. Gigascience 2021; 9:5867695. [PMID: 32627830 PMCID: PMC7336184 DOI: 10.1093/gigascience/giaa072] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 06/01/2020] [Accepted: 06/15/2020] [Indexed: 01/22/2023] Open
Abstract
Background Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available. Findings We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results. Conclusion We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.
Collapse
Affiliation(s)
- Kirill Kryukov
- Correspondence address. Kirill Kryukov, Department of Genomics and Evolutionary Biology, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan. E-mail:
| | - Mahoko Takahashi Ueda
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa 259–1193, Japan
- Current address: Department of Genomic Function and Diversity, Medical Research Institute, Tokyo Medical and Dental University, Bunkyo, Tokyo 113-8510, Japan
| | - So Nakagawa
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa 259–1193, Japan
| | - Tadashi Imanishi
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa 259–1193, Japan
| |
Collapse
|
9
|
Morales VS, Houghten S. Lossy Compression of Quality Values in Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1958-1969. [PMID: 31869798 DOI: 10.1109/tcbb.2019.2959273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The dropping cost of sequencing human DNA has allowed for fast development of several projects around the world generating huge amounts of DNA sequencing data. This deluge of data has run up against limited storage space, a problem that researchers are trying to solve through compression techniques. In this study we address the compression of SAM files, the standard output files for DNA alignment. We specifically study lossy compression techniques used for quality values reported in the SAM file and analyze the impact of such lossy techniques on the CRAM format. We present a series of experiments using a data set corresponding to individual NA12878 with three different fold coverages. We introduce a new lossy model, dynamic binning, and compare its performance to other lossy techniques, namely Illumina binning, LEON and QVZ. We analyze the compression ratio when using CRAM and also study the impact of the lossy techniques on SNP calling. Our results show that lossy techniques allow a better CRAM compression ratio. Furthermore, we show that SNP calling performance is not negatively affected and may even be boosted.
Collapse
|
10
|
Danciu D, Karasikov M, Mustafa H, Kahles A, Rätsch G. Topology-based sparsification of graph annotations. Bioinformatics 2021; 37:i169-i176. [PMID: 34252940 PMCID: PMC8346655 DOI: 10.1093/bioinformatics/btab330] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/03/2021] [Indexed: 01/03/2023] Open
Abstract
Motivation Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. Results In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. Availability and implementation RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.
Collapse
Affiliation(s)
- Daniel Danciu
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Harun Mustafa
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland.,Department of Biology, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
11
|
Ghosh Dasgupta M, Dev SA, Muneera Parveen AB, Sarath P, Sreekumar VB. Draft genome of Korthalsia laciniosa (Griff.) Mart., a climbing rattan elucidates its phylogenetic position. Genomics 2021; 113:2010-2022. [PMID: 33862180 DOI: 10.1016/j.ygeno.2021.04.023] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2020] [Revised: 03/21/2021] [Accepted: 04/11/2021] [Indexed: 12/28/2022]
Abstract
Korthalsia laciniosa (Griff.) Mart. is a climbing rattan used as a source of durable and flexible cane. In the present study, the draft genome of K. laciniosa was sequenced, de novo assembled and annotated. Genome-wide identification of MADS-Box transcription factors revealed loss of Mβ, and Mγ genes belonging to Type I subclass in the rattan lineage. Mining of the genome revealed presence of 13 families of lignin biosynthetic pathway genes and expression profiling of nine major genes documented relatively lower level of expression in cirrus when compared to leaflet and petiole. The chloroplast genome was re-constructed and analysis revealed the phylogenetic relatedness of this genus to Eugeissona, in contrast with its present taxonomic position. The genomic resource generated in the present study will accelerate population structure analysis, genetic resource conservation, phylogenomics and facilitate understanding the unique developmental processes like gender expression at molecular level.
Collapse
Affiliation(s)
- Modhumita Ghosh Dasgupta
- Institute of Forest Genetics and Tree Breeding, Forest Campus, R.S. Puram, Coimbatore Pincode-641002, India
| | - Suma Arun Dev
- Forest Genetics and Biotechnology Division, Kerala Forest Research Institute, Peechi P. O, Thrissur, Kerala 680653, India
| | - Abdul Bari Muneera Parveen
- Institute of Forest Genetics and Tree Breeding, Forest Campus, R.S. Puram, Coimbatore Pincode-641002, India
| | - Paremmal Sarath
- Forest Genetics and Biotechnology Division, Kerala Forest Research Institute, Peechi P. O, Thrissur, Kerala 680653, India; Ph.D. Scholar, Forest Research Institute Deemed to be University, Dehradun, Uttarakhand, India
| | - V B Sreekumar
- Forest Genetics and Biotechnology Division, Kerala Forest Research Institute, Peechi P. O, Thrissur, Kerala 680653, India
| |
Collapse
|
12
|
Nafees S, Rice SH, Wakeman CA. Analyzing genomic data using tensor-based orthogonal polynomials with application to synthetic RNAs. NAR Genom Bioinform 2020; 2:lqaa101. [PMID: 33575645 PMCID: PMC7731874 DOI: 10.1093/nargab/lqaa101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 11/06/2020] [Accepted: 11/27/2020] [Indexed: 02/06/2023] Open
Abstract
An important goal in molecular biology is to quantify both the patterns across a genomic sequence and the relationship between phenotype and underlying sequence. We propose a multivariate tensor-based orthogonal polynomial approach to characterize nucleotides or amino acids in a given sequence and map corresponding phenotypes onto the sequence space. We have applied this method to a previously published case of small transcription activating RNAs. Covariance patterns along the sequence showcased strong correlations between nucleotides at the ends of the sequence. However, when the phenotype is projected onto the sequence space, this pattern does not emerge. When doing second order analysis and quantifying the functional relationship between the phenotype and pairs of sites along the sequence, we identified sites with high regressions spread across the sequence, indicating potential intramolecular binding. In addition to quantifying interactions between different parts of a sequence, the method quantifies sequence–phenotype interactions at first and higher order levels. We discuss the strengths and constraints of the method and compare it to computational methods such as machine learning approaches. An accompanying command line tool to compute these polynomials is provided. We show proof of concept of this approach and demonstrate its potential application to other biological systems.
Collapse
Affiliation(s)
- Saba Nafees
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX 79409, USA
| | - Sean H Rice
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX 79409, USA
| | - Catherine A Wakeman
- Department of Biological Sciences, Texas Tech University, 2901 Main St, Lubbock, TX 79409, USA
| |
Collapse
|
13
|
Yu R, Yang W, Wang S. Performance evaluation of lossy quality compression algorithms for RNA-seq data. BMC Bioinformatics 2020; 21:321. [PMID: 32689929 PMCID: PMC7372835 DOI: 10.1186/s12859-020-03658-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 07/13/2020] [Indexed: 11/29/2022] Open
Abstract
Background Recent advancements in high-throughput sequencing technologies have generated an unprecedented amount of genomic data that must be stored, processed, and transmitted over the network for sharing. Lossy genomic data compression, especially of the base quality values of sequencing data, is emerging as an efficient way to handle this challenge due to its superior compression performance compared to lossless compression methods. Many lossy compression algorithms have been developed for and evaluated using DNA sequencing data. However, whether these algorithms can be used on RNA sequencing (RNA-seq) data remains unclear. Results In this study, we evaluated the impacts of lossy quality value compression on common RNA-seq data analysis pipelines including expression quantification, transcriptome assembly, and short variants detection using RNA-seq data from different species and sequencing platforms. Our study shows that lossy quality value compression could effectively improve RNA-seq data compression. In some cases, lossy algorithms achieved up to 1.2-3 times further reduction on the overall RNA-seq data size compared to existing lossless algorithms. However, lossy quality value compression could affect the results of some RNA-seq data processing pipelines, and hence its impacts to RNA-seq studies cannot be ignored in some cases. Pipelines using HISAT2 for alignment were most significantly affected by lossy quality value compression, while the effects of lossy compression on pipelines that do not depend on quality values, e.g., STAR-based expression quantification and transcriptome assembly pipelines, were not observed. Moreover, regardless of using either STAR or HISAT2 as the aligner, variant detection results were affected by lossy quality value compression, albeit to a lesser extent when STAR-based pipeline was used. Our results also show that the impacts of lossy quality value compression depend on the compression algorithms being used and the compression levels if the algorithm supports setting of multiple compression levels. Conclusions Lossy quality value compression can be incorporated into existing RNA-seq analysis pipelines to alleviate the data storage and transmission burdens. However, care should be taken on the selection of compression tools and levels based on the requirements of the downstream analysis pipelines to avoid introducing undesirable adverse effects on the analysis results.
Collapse
|
14
|
Yu R, Yang W. ScaleQC: a scalable lossy to lossless solution for NGS data compression. Bioinformatics 2020; 36:4551-4559. [DOI: 10.1093/bioinformatics/btaa543] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2019] [Revised: 04/25/2020] [Accepted: 05/20/2020] [Indexed: 12/30/2022] Open
Abstract
Abstract
Motivation
Per-base quality values in Next Generation Sequencing data take a significant portion of storage even after compression. Lossy compression technologies could further reduce the space used by quality values. However, in many applications, lossless compression is still desired. Hence, sequencing data in multiple file formats have to be prepared for different applications.
Results
We developed a scalable lossy to lossless compression solution for quality values named ScaleQC (Scalable Quality value Compression). ScaleQC is able to provide the so-called bit-stream level scalability that the losslessly compressed bit-stream by ScaleQC can be further truncated to lower data rates without incurring an expensive transcoding operation. Despite its scalability, ScaleQC still achieves comparable compression performance at both lossless and lossy data rates compared to the existing lossless or lossy compressors.
Availability and implementation
ScaleQC has been integrated with SAMtools as a special quality value encoding mode for CRAM. Its source codes can be obtained from our integrated SAMtools (https://github.com/xmuyulab/samtools) with dependency on integrated HTSlib (https://github.com/xmuyulab/htslib).
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rongshan Yu
- Digital Fujian Institute of Healthcare and Biomedical Big Data, School of Informatics, Xiamen University, Xiamen 316005, China
- Aginome Scientific, Xiamen 316005, China
| | | |
Collapse
|
15
|
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 2020; 21:249. [PMID: 32943081 PMCID: PMC7499882 DOI: 10.1186/s13059-020-02135-8] [Citation(s) in RCA: 61] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 08/06/2020] [Indexed: 02/07/2023] Open
Abstract
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in.Availability https://github.com/pmelsted/bifrost.
Collapse
Affiliation(s)
- Guillaume Holley
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland.
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland
| |
Collapse
|
16
|
Kowalski TM, Grabowski S. PgRC: pseudogenome-based read compressor. Bioinformatics 2020; 36:2082-2089. [PMID: 31893286 DOI: 10.1093/bioinformatics/btz919] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2019] [Revised: 11/28/2019] [Accepted: 12/05/2019] [Indexed: 01/26/2023] Open
Abstract
MOTIVATION The amount of sequencing data from high-throughput sequencing technologies grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. RESULTS We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 15 and 20% on average, respectively, while being comparably fast in decompression. AVAILABILITY AND IMPLEMENTATION PgRC can be downloaded from https://github.com/kowallus/PgRC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tomasz M Kowalski
- Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Lodz 90-924, Poland
| |
Collapse
|
17
|
Langa J, Estonba A, Conklin D. EXFI: Exon and splice graph prediction without a reference genome. Ecol Evol 2020; 10:8880-8893. [PMID: 32884664 PMCID: PMC7452765 DOI: 10.1002/ece3.6587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2020] [Revised: 06/03/2020] [Accepted: 06/08/2020] [Indexed: 11/19/2022] Open
Abstract
For population genetic studies in nonmodel organisms, it is important to use every single source of genomic information. This paper presents EXFI, a Python pipeline that predicts the splice graph and exon sequences using an assembled transcriptome and raw whole-genome sequencing reads. The main algorithm uses Bloom filters to remove reads that are not part of the transcriptome, to predict the intron-exon boundaries, to then proceed to call exons from the assembly, and to generate the underlying splice graph. The results are returned in GFA1 format, which encodes both the predicted exon sequences and how they are connected to form transcripts. EXFI is written in Python, tested on Linux platforms, and the source code is available under the MIT License at https://github.com/jlanga/exfi.
Collapse
Affiliation(s)
- Jorge Langa
- Department of Genetics, Physical Anthropology and Animal PhysiologyFaculty of Science and TechnologyUniversity of the Basque CountryLeioaSpain
| | - Andone Estonba
- Department of Genetics, Physical Anthropology and Animal PhysiologyFaculty of Science and TechnologyUniversity of the Basque CountryLeioaSpain
| | - Darrell Conklin
- Department of Computer Science and Artificial Intelligence, Faculty of Computer ScienceUniversity of the Basque Country UPV/EHUSan SebastiánSpain
- IKERBASQUE, Basque Foundation for ScienceBilbaoSpain
| |
Collapse
|
18
|
Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020; 35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. RESULTS We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minicom. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia
| | - Marcel E Dinger
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia.,St Vincent's Clinical School, University of New South Wales, Sydney, NSW, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| |
Collapse
|
19
|
Kryukov K, Ueda MT, Nakagawa S, Imanishi T. Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences. Bioinformatics 2020; 35:3826-3828. [PMID: 30799504 PMCID: PMC6761962 DOI: 10.1093/bioinformatics/btz144] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 02/13/2019] [Accepted: 02/22/2019] [Indexed: 11/13/2022] Open
Abstract
Summary DNA sequence databases use compression such as gzip to reduce the required storage space and network transmission time. We describe Nucleotide Archival Format (NAF)—a new file format for lossless reference-free compression of FASTA and FASTQ-formatted nucleotide sequences. Nucleotide Archival Format compression ratio is comparable to the best DNA compressors, while providing dramatically faster decompression. We compared our format with DNA compressors: DELIMINATE and MFCompress, and with general purpose compressors: gzip, bzip2, xz, brotli and zstd. Availability and implementation NAF compressor and decompressor, as well as format specification are available at https://github.com/KirillKryukov/naf. Format specification is in public domain. Compressor and decompressor are open source under the zlib/libpng license, free for nearly any use. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kirill Kryukov
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Japan
| | | | - So Nakagawa
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Japan.,Micro/Nano Technology Center, Tokai University, Hiratsuka, Japan
| | - Tadashi Imanishi
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Japan
| |
Collapse
|
20
|
Shibuya Y, Comin M. Indexing k-mers in linear space for quality value compression. J Bioinform Comput Biol 2019; 17:1940011. [DOI: 10.1142/s0219720019400110] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many bioinformatics tools heavily rely on [Formula: see text]-mer dictionaries to describe the composition of sequences and allow for faster reference-free algorithms or look-ups. Unfortunately, naive [Formula: see text]-mer dictionaries are very memory-inefficient, requiring very large amount of storage space to save each [Formula: see text]-mer. This problem is generally worsened by the necessity of an index for fast queries. In this work, we discuss how to build an indexed linear reference containing a set of input [Formula: see text]-mers and its application to the compression of quality scores in FASTQ files. Most of the entropies of sequencing data lie in the quality scores, and thus they are difficult to compress. Here, we present an application to improve the compressibility of quality values while preserving the information for SNP calling. We show how a dictionary of significant [Formula: see text]-mers, obtained from SNP databases and multiple genomes, can be indexed in linear space and used to improve the compression of quality value. Availability: The software is freely available at https://github.com/yhhshb/yalff .
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padua, via Gradenigo 6B, Padua, Italy
| |
Collapse
|
21
|
Shibuya Y, Comin M. Better quality score compression through sequence-based quality smoothing. BMC Bioinformatics 2019; 20:302. [PMID: 31757199 PMCID: PMC6873394 DOI: 10.1186/s12859-019-2883-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 05/07/2019] [Indexed: 11/10/2022] Open
Abstract
MOTIVATION Current NGS techniques are becoming exponentially cheaper. As a result, there is an exponential growth of genomic data unfortunately not followed by an exponential growth of storage, leading to the necessity of compression. Most of the entropy of NGS data lies in the quality values associated to each read. Those values are often more diversified than necessary. Because of that, many tools such as Quartz or GeneCodeq, try to change (smooth) quality scores in order to improve compressibility without altering the important information they carry for downstream analysis like SNP calling. RESULTS We use the FM-Index, a type of compressed suffix array, to reduce the storage requirements of a dictionary of k-mers and an effective smoothing algorithm to maintain high precision for SNP calling pipelines, while reducing quality scores entropy. We present YALFF (Yet Another Lossy Fastq Filter), a tool for quality scores compression by smoothing leading to improved compressibility of FASTQ files. The succinct k-mers dictionary allows YALFF to run on consumer computers with only 5.7 GB of available free RAM. YALFF smoothing algorithm can improve genotyping accuracy while using less resources. AVAILABILITY https://github.com/yhhshb/yalff.
Collapse
Affiliation(s)
- Yoshihiro Shibuya
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
- Laboratoire d’Informatique Gaspard-Monge (LIGM), University Paris-Est Marne-la-Vallée, Bâtiment Copernic - 5, bd Descartes, Champs sur Marne, France
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| |
Collapse
|
22
|
Mustafa H, Schilken I, Karasikov M, Eickhoff C, Rätsch G, Kahles A. Dynamic compression schemes for graph coloring. Bioinformatics 2019; 35:407-414. [PMID: 30020403 PMCID: PMC6530811 DOI: 10.1093/bioinformatics/bty632] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Accepted: 07/16/2018] [Indexed: 11/30/2022] Open
Abstract
Motivation Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability and implementation We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Ingo Schilken
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Carsten Eickhoff
- Brown Center for Biomedical Informatics, Brown University, Providence, RI, USA
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
23
|
Bonfield JK, McCarthy SA, Durbin R. Crumble: reference free lossy compression of sequence quality values. Bioinformatics 2019; 35:337-339. [PMID: 29992288 PMCID: PMC6330002 DOI: 10.1093/bioinformatics/bty608] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Accepted: 07/09/2018] [Indexed: 02/01/2023] Open
Abstract
Motivation The bulk of space taken up by NGS sequencing CRAM files consists of per-base quality values. Most of these are unnecessary for variant calling, offering an opportunity for space saving. Results On the Syndip test set, a 17 fold reduction in the quality storage portion of a CRAM file can be achieved while maintaining variant calling accuracy. The size reduction of an entire CRAM file varied from 2.2 to 7.4 fold, depending on the non-quality content of the original file (see Supplementary Material S6 for details). Availability and implementation Crumble is OpenSource and can be obtained from https://github.com/jkbonfield/crumble. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- James K Bonfield
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| | - Shane A McCarthy
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.,Department of Genetics, University of Cambridge, Cambridge, UK
| | - Richard Durbin
- DNA Pipelines, Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK.,Department of Genetics, University of Cambridge, Cambridge, UK
| |
Collapse
|
24
|
Roguski L, Ochoa I, Hernaez M, Deorowicz S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 2019; 34:2748-2756. [PMID: 29617939 DOI: 10.1093/bioinformatics/bty205] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Accepted: 03/27/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lukasz Roguski
- Centro Nacional de Análisis Genómico-Centre for Genomic Regulation, Barcelona Institute of Science and Technology (CNAG-CRG), Barcelona, Spain.,Experimental and Health Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, IL, USA
| | - Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
25
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
26
|
El Allali A, Arshad M. MZPAQ: a FASTQ data compression tool. SOURCE CODE FOR BIOLOGY AND MEDICINE 2019; 14:3. [PMID: 31171931 PMCID: PMC6547476 DOI: 10.1186/s13029-019-0073-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/16/2017] [Accepted: 05/23/2019] [Indexed: 11/10/2022]
Abstract
Background Due to the technological progress in Next Generation Sequencing (NGS), the amount of genomic data that is produced daily has seen a tremendous increase. This increase has shifted the bottleneck of genomic projects from sequencing to computation and specifically storing, managing and analyzing the large amount of NGS data. Compression tools can reduce the physical storage used to save large amount of genomic data as well as the bandwidth used to transfer this data. Recently, DNA sequence compression has gained much attention among researchers. Results In this paper, we study different techniques and algorithms used to compress genomic data. Most of these techniques take advantage of some properties that are unique to DNA sequences in order to improve the compression rate, and usually perform better than general-purpose compressors. By exploring the performance of available algorithms, we produce a powerful compression tool for NGS data called MZPAQ. Results show that MZPAQ outperforms state-of-the-art tools on all benchmark datasets obtained from a recent survey in terms of compression ratio. MZPAQ offers the best compression ratios regardless of the sequencing platform or the size of the data. Conclusions Currently, MZPAQ's strength is its higher compression ratio as well as its compatibility with all major sequencing platforms. MZPAQ is more suitable when the size of compressed data is crucial, such as long-term storage and data transfer. More efforts will be made in the future to target other aspects such as compression speed and memory utilization.
Collapse
Affiliation(s)
- Achraf El Allali
- Department of Computer Science, College of computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Mariam Arshad
- Department of Computer Science, College of computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
27
|
Abstract
Background:
Biological sequence data have increased at a rapid rate due to the
advancements in sequencing technologies and reduction in the cost of sequencing data. The huge
increase in these data presents significant research challenges to researchers. In addition to meaningful
analysis, data storage is also a challenge, an increase in data production is outpacing the storage
capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as
well as transmission cost over the internet.
Objective:
This article presents a novel compression algorithm (FCompress) for Next Generation
Sequencing (NGS) data in FASTQ format.
Method:
The proposed algorithm uses bits manipulation and dictionary-based compression for bases
compression. Headers are compressed with reference-based compression, whereas quality scores are
compressed with Huffman coding.
Results:
The proposed algorithm is validated with experimental results on real datasets. The results are
compared with both general purpose and specialized compression programs.
Conclusion:
The proposed algorithm produces better compression ratio in a comparable time to other
algorithms.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- Department of Computer Science, COMSATS Institute of Information Technology, Attock, Pakistan
| | - Muhammad Tahir
- Department of Computer Science, COMSATS Institute of Information Technology, Attock, Pakistan
| |
Collapse
|
28
|
Chandak S, Tatwawadi K, Weissman T. Compression of genomic sequencing reads via hash-based reordering: algorithm and analysis. Bioinformatics 2018; 34:558-567. [PMID: 29444237 DOI: 10.1093/bioinformatics/btx639] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 10/06/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation New Generation Sequencing (NGS) technologies for genome sequencing produce large amounts of short genomic reads per experiment, which are highly redundant and compressible. However, general-purpose compressors are unable to exploit this redundancy due to the special structure present in the data. Results We present a new algorithm for compressing reads both with and without preserving the read order. In both cases, it achieves 1.4×-2× compression gain over state-of-the-art read compression tools for datasets containing as many as 3 billion Illumina reads. Our tool is based on the idea of approximately reordering the reads according to their position in the genome using hashed substring indices. We also present a systematic analysis of the read compression problem and compute bounds on fundamental limits of read compression. This analysis sheds light on the dynamics of the proposed algorithm (and read compression algorithms in general) and helps understand its performance in practice. The algorithm compresses only the read sequence, works with unaligned FASTQ files, and does not require a reference. Contact schandak@stanford.edu. Supplementary information Supplementary material are available at Bioinformatics online. The proposed algorithm is available for download at https://github.com/shubhamchandak94/HARC.
Collapse
Affiliation(s)
- Shubham Chandak
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Kedar Tatwawadi
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
29
|
Wang R, Li J, Bai Y, Zang T, Wang Y. BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs. PeerJ 2018; 6:e5611. [PMID: 30364599 PMCID: PMC6197042 DOI: 10.7717/peerj.5611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2018] [Accepted: 09/13/2018] [Indexed: 02/01/2023] Open
Abstract
Dramatic increases in data produced by next-generation sequencing (NGS) technologies demand data compression tools for saving storage space. However, effective and efficient data compression for genome sequencing data has remained an unresolved challenge in NGS data studies. In this paper, we propose a novel alignment-free and reference-free compression method, BdBG, which is the first to compress genome sequencing data with dynamic de Bruijn graphs based on the data after bucketing. Compared with existing de Bruijn graph methods, BdBG only stored a list of bucket indexes and bifurcations for the raw read sequences, and this feature can effectively reduce storage space. Experimental results on several genome sequencing datasets show the effectiveness of BdBG over three state-of-the-art methods. BdBG is written in python and it is an open source software distributed under the MIT license, available for download at https://github.com/rongjiewang/BdBG.
Collapse
Affiliation(s)
- Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China
| | - Yang Bai
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| | - Tianyi Zang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, HeiLongJiang, China
| |
Collapse
|
30
|
Turner I, Garimella KV, Iqbal Z, McVean G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 2018; 34:2556-2565. [PMID: 29554215 PMCID: PMC6061703 DOI: 10.1093/bioinformatics/bty157] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 11/25/2017] [Accepted: 03/14/2018] [Indexed: 12/27/2022] Open
Abstract
Motivation The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. Results We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. Availability and implementation Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Isaac Turner
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Kiran V Garimella
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| |
Collapse
|
31
|
Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. J Comput Biol 2018; 25:825-836. [PMID: 30011247 DOI: 10.1089/cmb.2018.0068] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.
Collapse
Affiliation(s)
- Guillaume Holley
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany .,2 International Research Training Group 1906 "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes," Bielefeld University , Bielefeld, Germany
| | - Roland Wittler
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany .,2 International Research Training Group 1906 "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes," Bielefeld University , Bielefeld, Germany
| | - Jens Stoye
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany
| | - Faraz Hach
- 3 School of Computing Science, Simon Fraser University , Burnaby, Canada .,4 Department of Urologic Sciences, University of British Columbia , Vancouver, Canada .,5 Vancouver Prostate Centre , Vancouver, Canada
| |
Collapse
|
32
|
Abstract
Codon usage depends on mutation bias, tRNA-mediated selection, and the need for high efficiency and accuracy in translation. One codon in a synonymous codon family is often strongly over-used, especially in highly expressed genes, which often leads to a high dN/dS ratio because dS is very small. Many different codon usage indices have been proposed to measure codon usage and codon adaptation. Sense codon could be misread by release factors and stop codons misread by tRNAs, which also contribute to codon usage in rare cases. This chapter outlines the conceptual framework on codon evolution, illustrates codon-specific and gene-specific codon usage indices, and presents their applications. A new index for codon adaptation that accounts for background mutation bias (Index of Translation Elongation) is presented and contrasted with codon adaptation index (CAI) which does not consider background mutation bias. They are used to re-analyze data from a recent paper claiming that translation elongation efficiency matters little in protein production. The reanalysis disproves the claim.
Collapse
|
33
|
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes. ENTROPY 2018; 20:e20060393. [PMID: 33265483 PMCID: PMC7512912 DOI: 10.3390/e20060393] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2018] [Revised: 05/16/2018] [Accepted: 05/21/2018] [Indexed: 11/26/2022]
Abstract
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.
Collapse
|
34
|
Optimal compressed representation of high throughput sequence data via light assembly. Nat Commun 2018; 9:566. [PMID: 29422526 PMCID: PMC5805770 DOI: 10.1038/s41467-017-02480-6] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2017] [Accepted: 12/05/2017] [Indexed: 12/21/2022] Open
Abstract
The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed. Increase in high throughput sequencing (HTS) data warrants compression methods to facilitate better storage and communication. Here, Ginart et al. introduce Assembltrie, a reference-free compression tool which is guaranteed to achieve optimality for error-free reads.
Collapse
|
35
|
ARSDA: A New Approach for Storing, Transmitting and Analyzing Transcriptomic Data. G3-GENES GENOMES GENETICS 2017; 7:3839-3848. [PMID: 29079682 PMCID: PMC5714481 DOI: 10.1534/g3.117.300271] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Two major stumbling blocks exist in high-throughput sequencing (HTS) data analysis. The first is the sheer file size, typically in gigabytes when uncompressed, causing problems in storage, transmission, and analysis. However, these files do not need to be so large, and can be reduced without loss of information. Each HTS file, either in compressed .SRA or plain text .fastq format, contains numerous identical reads stored as separate entries. For example, among 44,603,541 forward reads in the SRR4011234.sra file (from a Bacillus subtilis transcriptomic study) deposited at NCBI’s SRA database, one read has 497,027 identical copies. Instead of storing them as separate entries, one can and should store them as a single entry with the SeqID_NumCopy format (which I dub as FASTA+ format). The second is the proper allocation of reads that map equally well to paralogous genes. I illustrate in detail a new method for such allocation. I have developed ARSDA software that implement these new approaches. A number of HTS files for model species are in the process of being processed and deposited at http://coevol.rdc.uottawa.ca to demonstrate that this approach not only saves a huge amount of storage space and transmission bandwidth, but also dramatically reduces time in downstream data analysis. Instead of matching the 497,027 identical reads separately against the B. subtilis genome, one only needs to match it once. ARSDA includes functions to take advantage of HTS data in the new sequence format for downstream data analysis such as gene expression characterization. I contrasted gene expression results between ARSDA and Cufflinks so readers can better appreciate the strength of ARSDA. ARSDA is freely available for Windows, Linux. and Macintosh computers at http://dambe.bio.uottawa.ca/ARSDA/ARSDA.aspx.
Collapse
|
36
|
Ochoa I, Hernaez M, Goldfeder R, Weissman T, Ashley E. Effect of lossy compression of quality scores on variant calling. Brief Bioinform 2017; 18:183-194. [PMID: 26966283 DOI: 10.1093/bib/bbw011] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Indexed: 12/30/2022] Open
Abstract
Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear.Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed. We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.
Collapse
Affiliation(s)
- Idoia Ochoa
- Electrical Engineering department, 350 Serra Mall, Stanford, CA, USA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Rachel Goldfeder
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA
| | - Euan Ashley
- Department of Medicine, Stanford University, Stanford, CA, USA.,Stanford Center for Inherited Cardiovascular Disease, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
37
|
Sarkar H, Patro R. Quark enables semi-reference-based compression of RNA-seq data. Bioinformatics 2017; 33:3380-3386. [DOI: 10.1093/bioinformatics/btx428] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2016] [Accepted: 06/29/2017] [Indexed: 12/19/2022] Open
Affiliation(s)
- Hirak Sarkar
- Department of Computer Science, Stony Brook University Stony Brook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University Stony Brook, NY, USA
| |
Collapse
|
38
|
Pellow D, Filippova D, Kingsford C. Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters. J Comput Biol 2017; 24:547-557. [PMID: 27828710 PMCID: PMC5467106 DOI: 10.1089/cmb.2016.0155] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
Collapse
Affiliation(s)
- David Pellow
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | | | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
| |
Collapse
|
39
|
Huang ZA, Wen Z, Deng Q, Chu Y, Sun Y, Zhu Z. LW-FQZip 2: a parallelized reference-based compression of FASTQ files. BMC Bioinformatics 2017; 18:179. [PMID: 28320326 PMCID: PMC5359991 DOI: 10.1186/s12859-017-1588-x] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2016] [Accepted: 03/09/2017] [Indexed: 01/06/2023] Open
Abstract
BACKGROUND The rapid progress of high-throughput DNA sequencing techniques has dramatically reduced the costs of whole genome sequencing, which leads to revolutionary advances in gene industry. The explosively increasing volume of raw data outpaces the decreasing disk cost and the storage of huge sequencing data has become a bottleneck of downstream analyses. Data compression is considered as a solution to reduce the dependency on storage. Efficient sequencing data compression methods are highly demanded. RESULTS In this article, we present a lossless reference-based compression method namely LW-FQZip 2 targeted at FASTQ files. LW-FQZip 2 is improved from LW-FQZip 1 by introducing more efficient coding scheme and parallelism. Particularly, LW-FQZip 2 is equipped with a light-weight mapping model, bitwise prediction by partial matching model, arithmetic coding, and multi-threading parallelism. LW-FQZip 2 is evaluated on both short-read and long-read data generated from various sequencing platforms. The experimental results show that LW-FQZip 2 is able to obtain promising compression ratios at reasonable time and memory space costs. CONCLUSIONS The competence enables LW-FQZip 2 to serve as a candidate tool for archival or space-sensitive applications of high-throughput DNA sequencing data. LW-FQZip 2 is freely available at http://csse.szu.edu.cn/staff/zhuzx/LWFQZip2 and https://github.com/Zhuzxlab/LW-FQZip2 .
Collapse
Affiliation(s)
- Zhi-An Huang
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Zhenkun Wen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Qingjin Deng
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Ying Chu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| | - Yiwen Sun
- School of Medicine, Shenzhen University, Shenzhen, 518060 China
| | - Zexuan Zhu
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060 China
| |
Collapse
|
40
|
Milan T, Wilhelm BT. Mining Cancer Transcriptomes: Bioinformatic Tools and the Remaining Challenges. Mol Diagn Ther 2017; 21:249-258. [DOI: 10.1007/s40291-017-0264-1] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
41
|
Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-56970-3_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
42
|
Comparison of high-throughput sequencing data compression tools. Nat Methods 2016; 13:1005-1008. [PMID: 27776113 DOI: 10.1038/nmeth.4037] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/01/2016] [Indexed: 12/27/2022]
Abstract
High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.
Collapse
|
43
|
|
44
|
Greenfield DL, Stegle O, Rrustemi A. GeneCodeq: quality score compression and improved genotyping using a Bayesian framework. Bioinformatics 2016; 32:3124-3132. [DOI: 10.1093/bioinformatics/btw385] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2015] [Accepted: 06/15/2016] [Indexed: 12/30/2022] Open
|