1
|
Roy S, Mukhopadhyay A. A randomized optimal k-mer indexing approach for efficient parallel genome sequence compression. Gene 2024; 907:148235. [PMID: 38342250 DOI: 10.1016/j.gene.2024.148235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 01/13/2024] [Accepted: 01/30/2024] [Indexed: 02/13/2024]
Abstract
Next Generation Sequencing (NGS) technology generates massive amounts of genome sequence that increases rapidly over time. As a result, there is a growing need for efficient compression algorithms to facilitate the processing, storage, transmission, and analysis of large-scale genome sequences. Over the past 31 years, numerous state-of-the-art compression algorithms have been developed. The performance of any compression algorithm is measured by three main compression metrics: compression ratio, time, and memory usage. Existing k-mer hash indexing systems take more time, due to the decision-making process based on compression results. In this paper, we propose a two-phase reference genome compression algorithm using optimal k-mer length (RGCOK). Reference-based compression takes advantage of the inter-similarity between chromosomes of the same species. RGCOK achieves this by finding the optimal k-mer length for matching, using a randomization method and hashing. The performance of RGCOK was evaluated on three different benchmark data sets: novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Homo sapiens, and other species sequences using an Amazon AWS virtual cloud machine. Experiments showed that the optimal k-mer finding time by RGCOK is around 45.28 min, whereas the time for existing state-of-the-art algorithms HiRGC, SCCG, and HRCM ranges from 58 min to 8.97 h.
Collapse
Affiliation(s)
- Subhankar Roy
- Department of Computer Science and Engineering, Academy of Technology, Adisaptagram, Hooghly 712121, West Bengal, India.
| | - Anirban Mukhopadhyay
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, Nadia 741235, West Bengal, India.
| |
Collapse
|
2
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.15.536996. [PMID: 37131636 PMCID: PMC10153118 DOI: 10.1101/2023.04.15.536996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k -mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
|
3
|
Liu Y, Shen X, Gong Y, Liu Y, Song B, Zeng X. Sequence Alignment/Map format: a comprehensive review of approaches and applications. Brief Bioinform 2023; 24:bbad320. [PMID: 37668049 DOI: 10.1093/bib/bbad320] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 08/16/2023] [Accepted: 08/18/2023] [Indexed: 09/06/2023] Open
Abstract
The Sequence Alignment/Map (SAM) format file is the text file used to record alignment information. Alignment is the core of sequencing analysis, and downstream tasks accept mapping results for further processing. Given the rapid development of the sequencing industry today, a comprehensive understanding of the SAM format and related tools is necessary to meet the challenges of data processing and analysis. This paper is devoted to retrieving knowledge in the broad field of SAM. First, the format of SAM is introduced to understand the overall process of the sequencing analysis. Then, existing work is systematically classified in accordance with generation, compression and application, and the involved SAM tools are specifically mined. Lastly, a summary and some thoughts on future directions are provided.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangzhen Shen
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Yongshun Gong
- School of Software, Shandong University, 250100, Jinan, China
| | - Yiping Liu
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410086, Changsha, China
| |
Collapse
|
4
|
Kryukov K, Jin L, Nakagawa S. Efficient compression of SARS-CoV-2 genome data using Nucleotide Archival Format. PATTERNS (NEW YORK, N.Y.) 2022; 3:100562. [PMID: 35818472 PMCID: PMC9259476 DOI: 10.1016/j.patter.2022.100562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genome data are essential for epidemiology, vaccine development, and tracking emerging variants. Millions of SARS-CoV-2 genomes have been sequenced during the pandemic. However, downloading SARS-CoV-2 genomes from databases is slow and unreliable, largely due to suboptimal choice of compression method. We evaluated the available compressors and found that Nucleotide Archival Format (NAF) would provide a drastic improvement compared with current methods. For Global Initiative on Sharing Avian Flu Data's (GISAID) pre-compressed datasets, NAF would increase efficiency 52.2 times for gzip-compressed data and 3.7 times for xz-compressed data. For DNA DataBank of Japan (DDBJ), NAF would improve throughput 40 times for gzip-compressed data. For GenBank and European Nucleotide Archive (ENA), NAF would accelerate data distribution by a factor of 29.3 times compared with uncompressed FASTA. This article provides a tutorial for installing and using NAF. Offering a NAF download option in sequence databases would provide a significant saving of time, bandwidth, and disk space and accelerate biological and medical research worldwide.
Collapse
Affiliation(s)
- Kirill Kryukov
- Department of Informatics, National Institute of Genetics, Mishima, Shizuoka 411-8540, Japan
| | - Lihua Jin
- Genomus Co., Ltd., Sagamihara, Kanagawa 252-0226, Japan
| | - So Nakagawa
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa 259-1193, Japan
| |
Collapse
|
5
|
A Hybrid Data-Differencing and Compression Algorithm for the Automotive Industry. ENTROPY 2022; 24:e24050574. [PMID: 35626459 PMCID: PMC9140898 DOI: 10.3390/e24050574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 04/08/2022] [Accepted: 04/14/2022] [Indexed: 11/29/2022]
Abstract
We propose an innovative delta-differencing algorithm that combines software-updating methods with LZ77 data compression. This software-updating method relates to server-side software that creates binary delta files and to client-side software that performs software-update installations. The proposed algorithm creates binary-differencing streams already compressed from an initial phase. We present a software-updating method suitable for OTA software updates and the method’s basic strategies to achieve a better performance in terms of speed, compression ratio or a combination of both. A comparison with publicly available solutions is provided. Our test results show our method, Keops, can outperform an LZMA (Lempel–Ziv–Markov chain-algorithm) based binary differencing solution in terms of compression ratio in two cases by more than 3% while being two to five times faster in decompression. We also prove experimentally that the difference between Keops and other competing delta-creator software increases when larger history buffers are used. In one case, we achieve a three times better performance for a delta rate compared to other competing delta rates.
Collapse
|
6
|
Kryukov K, Ueda MT, Nakagawa S, Imanishi T. Sequence Compression Benchmark (SCB) database-A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. Gigascience 2021; 9:5867695. [PMID: 32627830 PMCID: PMC7336184 DOI: 10.1093/gigascience/giaa072] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2020] [Revised: 06/01/2020] [Accepted: 06/15/2020] [Indexed: 01/22/2023] Open
Abstract
Background Nearly all molecular sequence databases currently use gzip for data compression. Ongoing rapid accumulation of stored data calls for a more efficient compression tool. Although numerous compressors exist, both specialized and general-purpose, choosing one of them was difficult because no comprehensive analysis of their comparative advantages for sequence compression was available. Findings We systematically benchmarked 430 settings of 48 compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on representative FASTA-formatted datasets of DNA, RNA, and protein sequences. Each compressor was evaluated on 17 performance measures, including compression strength, as well as time and memory required for compression and decompression. We used 27 test datasets including individual genomes of various sizes, DNA and RNA datasets, and standard protein datasets. We summarized the results as the Sequence Compression Benchmark database (SCB database, http://kirr.dyndns.org/sequence-compression-benchmark/), which allows custom visualizations to be built for selected subsets of benchmark results. Conclusion We found that modern compressors offer a large improvement in compactness and speed compared to gzip. Our benchmark allows compressors and their settings to be compared using a variety of performance measures, offering the opportunity to select the optimal compressor on the basis of the data type and usage scenario specific to a particular application.
Collapse
Affiliation(s)
- Kirill Kryukov
- Correspondence address. Kirill Kryukov, Department of Genomics and Evolutionary Biology, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan. E-mail:
| | - Mahoko Takahashi Ueda
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa 259–1193, Japan
- Current address: Department of Genomic Function and Diversity, Medical Research Institute, Tokyo Medical and Dental University, Bunkyo, Tokyo 113-8510, Japan
| | - So Nakagawa
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa 259–1193, Japan
| | - Tadashi Imanishi
- Department of Molecular Life Science, Tokai University School of Medicine, Isehara, Kanagawa 259–1193, Japan
| |
Collapse
|
7
|
Sardaraz M, Tahir M. SCA-NGS: Secure compression algorithm for next generation sequencing data using genetic operators and block sorting. Sci Prog 2021; 104:368504211023276. [PMID: 34143692 PMCID: PMC10454964 DOI: 10.1177/00368504211023276] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Recent advancements in sequencing methods have led to significant increase in sequencing data. Increase in sequencing data leads to research challenges such as storage, transfer, processing, etc. data compression techniques have been opted to cope with the storage of these data. There have been good achievements in compression ratio and execution time. This fast-paced advancement has raised major concerns about the security of data. Confidentiality, integrity, authenticity of data needs to be ensured. This paper presents a novel lossless reference-free algorithm that focuses on data compression along with encryption to achieve security in addition to other parameters. The proposed algorithm uses preprocessing of data before applying general-purpose compression library. Genetic algorithm is used to encrypt the data. The technique is validated with experimental results on benchmark datasets. Comparative analysis with state-of-the-art techniques is presented. The results show that the proposed method achieves better results in comparison to existing methods.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- Department of Computer Science, Faculty of Information Sciences & Technology, COMSATS University Islamabad, Attock Campus, Attock, Punjab, Pakistan
| | - Muhammad Tahir
- Department of Computer Science, Faculty of Information Sciences & Technology, COMSATS University Islamabad, Attock Campus, Attock, Punjab, Pakistan
| |
Collapse
|
8
|
Karcioglu AA, Bulut H. Improving hash-q exact string matching algorithm with perfect hashing for DNA sequences. Comput Biol Med 2021; 131:104292. [PMID: 33662682 DOI: 10.1016/j.compbiomed.2021.104292] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 02/16/2021] [Accepted: 02/16/2021] [Indexed: 12/22/2022]
Abstract
Exact string matching algorithms involve finding all occurrences of a pattern P in a text T. These algorithms have been extensively studied in computer science, primarily because of their applications in various fields such as text search and computational biology. The main goal of exact string matching algorithms is to find all pattern matches correctly within the shortest possible time frame. Although hash-based string matching algorithms run fast, there are shortcomings, such as hash collisions. In this study, a novel hash function has been proposed that eliminates hash collisions for DNA sequences. It provides us perfect hashing and produces hash values in a time-efficient manner. We have proposed two exact string matching algorithms based on the proposed hash function. In the first approach, we replace the traditional Hash-q algorithm's hash function with the proposed one. In the second approach, we improved the first approach by utilizing the shift size indicated at the (m-1)th entry in the good suffix shift table when an exact matching is found. In these approaches, we eliminate the need to compare the last q characters of the pattern and text. We have included six algorithms from the literature in our evaluations. E. Coli and Human Chromosome1 datasets from the literature and a synthetic dataset produced randomly are utilized for comparisons. The results show that the proposed approaches achieve better performance metrics in terms of the average runtime, the average number of character comparisons, and the average number of hash comparisons.
Collapse
Affiliation(s)
| | - Hasan Bulut
- Department of Computer Engineering, Ege University, Izmir, Turkey.
| |
Collapse
|
9
|
Liu Y, Wong L, Li J. Allowing mutations in maximal matches boosts genome compression performance. Bioinformatics 2020; 36:4675-4681. [PMID: 33118018 DOI: 10.1093/bioinformatics/btaa572] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2020] [Revised: 05/05/2020] [Accepted: 06/10/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. RESULTS We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/memRGC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
10
|
Liu Y, Yu Z, Dinger ME, Li J. Index suffix-prefix overlaps by (w, k)-minimizer to generate long contigs for reads compression. Bioinformatics 2020; 35:2066-2074. [PMID: 30407482 DOI: 10.1093/bioinformatics/bty936] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2018] [Revised: 11/04/2018] [Accepted: 11/07/2018] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Advanced high-throughput sequencing technologies have produced massive amount of reads data, and algorithms have been specially designed to contract the size of these datasets for efficient storage and transmission. Reordering reads with regard to their positions in de novo assembled contigs or in explicit reference sequences has been proven to be one of the most effective reads compression approach. As there is usually no good prior knowledge about the reference sequence, current focus is on the novel construction of de novo assembled contigs. RESULTS We introduce a new de novo compression algorithm named minicom. This algorithm uses large k-minimizers to index the reads and subgroup those that have the same minimizer. Within each subgroup, a contig is constructed. Then some pairs of the contigs derived from the subgroups are merged into longer contigs according to a (w, k)-minimizer-indexed suffix-prefix overlap similarity between two contigs. This merging process is repeated after the longer contigs are formed until no pair of contigs can be merged. We compare the performance of minicom with two reference-based methods and four de novo methods on 18 datasets (13 RNA-seq datasets and 5 whole genome sequencing datasets). In the compression of single-end reads, minicom obtained the smallest file size for 22 of 34 cases with significant improvement. In the compression of paired-end reads, minicom achieved 20-80% compression gain over the best state-of-the-art algorithm. Our method also achieved a 10% size reduction of compressed files in comparison with the best algorithm under the reads-order preserving mode. These excellent performances are mainly attributed to the exploit of the redundancy of the repetitive substrings in the long contigs. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/minicom. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| | - Zuguo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Hunan Key Laboratory for Computation and Simulation in Science and Engineering, Xiangtan University, Hunan, China.,School of Electrical Engineering and Computer Science, Queensland University of Technology, Brisbane, Australia
| | - Marcel E Dinger
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Sydney, NSW, Australia.,St Vincent's Clinical School, University of New South Wales, Sydney, NSW, Australia
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, Australia
| |
Collapse
|
11
|
Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review. PLoS One 2020; 15:e0232942. [PMID: 32453750 PMCID: PMC7250429 DOI: 10.1371/journal.pone.0232942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 04/25/2020] [Indexed: 11/19/2022] Open
Abstract
The recent decrease in cost and time to sequence and assemble of complete genomes created an increased demand for data storage. As a consequence, several strategies for assembled biological data compression were created. Vertical compression tools implement strategies that take advantage of the high level of similarity between multiple assembled genomic sequences for better compression results. However, current reviews on vertical compression do not compare the execution flow of each tool, which is constituted by phases of preprocessing, transformation, and data encoding. We performed a systematic literature review to identify and compare existing tools for vertical compression of assembled genomic sequences. The review was centered on PubMed and Scopus, in which 45726 distinct papers were considered. Next, 32 papers were selected according to the following criteria: to present a lossless vertical compression tool; to use the information contained in other sequences for the compression; to be able to manipulate genomic sequences in FASTA format; and no need prior knowledge. Although we extracted performance compression results, they were not compared as the tools did not use a standardized evaluation protocol. Thus, we conclude that there's a lack of definition of an evaluation protocol that must be applied by each tool.
Collapse
Affiliation(s)
- Kelvin V. Kredens
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Juliano V. Martins
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Osmar B. Dordal
- Polytechnic School, Centro Universitário UniDomBosco, Curitiba, Paraná, Brazil
| | - Mauri Ferrandin
- Department of Control, Automation and Computing Engineering, Universidade Federal de Santa Catarina (UFSC), Blumenau, Brazil
| | - Roberto H. Herai
- Graduate Program in Health Sciences, School of Medicine, Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba, Paraná, Brazil
| | - Edson E. Scalabrin
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| | - Bráulio C. Ávila
- Graduate Program in Informatics (PPGia), Pontifícia Universidade Católica do Paraná, Curitiba, Paraná, Brazil
| |
Collapse
|
12
|
Clinical application of genomic high-throughput data: Infrastructural, ethical, legal and psychosocial aspects. Eur Neuropsychopharmacol 2020; 31:1-15. [PMID: 31866110 DOI: 10.1016/j.euroneuro.2019.09.008] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/22/2017] [Revised: 11/03/2018] [Accepted: 09/20/2019] [Indexed: 12/28/2022]
Abstract
Genomic high-throughput technologies (GHTT) such as next-generation sequencing represent a fast and cost-effective tool toward a more comprehensive understanding of the molecular background of complex diseases. However, technological advances contrast with insufficient application in clinical practice. Thus, patients, physicians, and other professionals are faced with tough challenges that forestall the efficient and effective implementation. With the increasing application of genetic testing, it is of paramount importance that physicians and other professionals in healthcare recognize the restrictions and potential of GHTT, in order to understand and interpret the complex data in the context of health and disease. At the same time, the growing volume and complexity of data is forever increasing the need for sustainable infrastructure and state-of-the-art tools for efficient data management, including their analysis and integration. The large pool of sensitive information remains difficult to interpret and fundamental questions spanning from billing to legal, social, and ethical issues have still not been resolved. Here we summarize and discuss these obstacles in an interdisciplinary context and suggest ways to overcome them. Continuous discussion with clinicians, data managers, biostatisticians, systems medicine experts, ethicists, legal scholars, and patients illuminates the strengths, weakness, and current practices in the pipeline from biomaterial to sequencing and data management. This discussion also highlights the new, cross-disciplinary working collaborations to realize the wide-ranging challenges in clinical genomics including the exceptional demands placed on the staff preparing and presenting the data, as well as the question as to how to report the data and results to patients.
Collapse
|
13
|
Abstract
The amount of data produced by modern sequencing instruments that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives. We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools. The drawbacks of the proposed method are large memory and time requirements.
Collapse
|
14
|
Buchmann J, Geihs M, Hamacher K, Katzenbeisser S, Stammler S. Long-term integrity protection of genomic data. EURASIP JOURNAL ON INFORMATION SECURITY 2019. [DOI: 10.1186/s13635-019-0099-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
Genomic data is crucial in the understanding of many diseases and for the guidance of medical treatments. Pharmacogenomics and cancer genomics are just two areas in precision medicine of rapidly growing utilization. At the same time, whole-genome sequencing costs are plummeting below $ 1000, meaning that a rapid growth in full-genome data storage requirements is foreseeable. While privacy protection of genomic data is receiving growing attention, integrity protection of this long-lived and highly sensitive data much less so.We consider a scenario inspired by future pharmacogenomics, in which a patient’s genome data is stored over a long time period while random parts of it are periodically accessed by authorized parties such as doctors and clinicians. A protection scheme is described that preserves integrity of the genomic data in that scenario over a time horizon of 100 years. During such a long time period, cryptographic schemes will potentially break and therefore our scheme allows to update the integrity protection. Furthermore, integrity of parts of the genomic data can be verified without compromising the privacy of the remaining data. Finally, a performance evaluation and cost projection shows that privacy-preserving long-term integrity protection of genomic data is resource demanding, but in reach of current and future hardware technology and has negligible costs of storage.
Collapse
|
15
|
Roguski L, Ochoa I, Hernaez M, Deorowicz S. FaStore: a space-saving solution for raw sequencing data. Bioinformatics 2019; 34:2748-2756. [PMID: 29617939 DOI: 10.1093/bioinformatics/bty205] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Accepted: 03/27/2018] [Indexed: 12/29/2022] Open
Abstract
Motivation The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lukasz Roguski
- Centro Nacional de Análisis Genómico-Centre for Genomic Regulation, Barcelona Institute of Science and Technology (CNAG-CRG), Barcelona, Spain.,Experimental and Health Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Idoia Ochoa
- Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, IL, USA
| | - Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, IL, USA
| | - Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
16
|
Abstract
Recently, there has been growing interest in genome sequencing, driven by advances in sequencing technology, in terms of both efficiency and affordability. These developments have allowed many to envision whole-genome sequencing as an invaluable tool for both personalized medical care and public health. As a result, increasingly large and ubiquitous genomic data sets are being generated. This poses a significant challenge for the storage and transmission of these data. Already, it is more expensive to store genomic data for a decade than it is to obtain the data in the first place. This situation calls for efficient representations of genomic information. In this review, we emphasize the need for designing specialized compressors tailored to genomic data and describe the main solutions already proposed. We also give general guidelines for storing these data and conclude with our thoughts on the future of genomic formats and compressors.
Collapse
Affiliation(s)
- Mikel Hernaez
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| | - Dmitri Pavlichin
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, Stanford, California 94305, USA
| | - Idoia Ochoa
- Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, USA
| |
Collapse
|
17
|
Danek A, Deorowicz S. GTC: how to maintain huge genotype collections in a compressed form. Bioinformatics 2019; 34:1834-1840. [PMID: 29351600 DOI: 10.1093/bioinformatics/bty023] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2017] [Accepted: 01/15/2018] [Indexed: 12/20/2022] Open
Abstract
Motivation Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial. Results We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second. Availability and implementation GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc. Contact sebastian.deorowicz@polsl.pl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Agnieszka Danek
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
18
|
Deorowicz S, Debudaj-Grabysz A, Gudyś A, Grabowski S. Whisper: read sorting allows robust mapping of DNA sequencing data. Bioinformatics 2019; 35:2043-2050. [PMID: 30407485 DOI: 10.1093/bioinformatics/bty927] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 10/16/2018] [Accepted: 11/06/2018] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time. RESULTS We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline. AVAILABILITY AND IMPLEMENTATION Whisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, PL, Poland
| | - Agnieszka Debudaj-Grabysz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, PL, Poland
| | - Adam Gudyś
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 16, Gliwice, PL, Poland
| | - Szymon Grabowski
- Institute of Applied Computer Science, Faculty of Electrical, Electronic, Computer and Control Engineering, Lodz University of Technology, Stefanowskiego 18/22, Łódź, PL, Poland
| |
Collapse
|
19
|
Najam M, Rasool RU, Ahmad HF, Ashraf U, Malik AW. Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters. BIOMED RESEARCH INTERNATIONAL 2019; 2019:7074387. [PMID: 31111064 PMCID: PMC6487161 DOI: 10.1155/2019/7074387] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 03/29/2019] [Accepted: 03/31/2019] [Indexed: 12/24/2022]
Abstract
Storing and processing of large DNA sequences has always been a major problem due to increasing volume of DNA sequence data. However, a number of solutions have been proposed but they require significant computation and memory. Therefore, an efficient storage and pattern matching solution is required for DNA sequencing data. Bloom filters (BFs) represent an efficient data structure, which is mostly used in the domain of bioinformatics for classification of DNA sequences. In this paper, we explore more dimensions where BFs can be used other than classification. A proposed solution is based on Multiple Bloom Filters (MBFs) that finds all the locations and number of repetitions of the specified pattern inside a DNA sequence. Both of these factors are extremely important in determining the type and intensity of any disease. This paper serves as a first effort towards optimizing the search for location and frequency of substrings in DNA sequences using MBFs. We expect that further optimizations in the proposed solution can bring remarkable results as this paper presents a proof of concept implementation for a given set of data using proposed MBFs technique. Performance evaluation shows improved accuracy and time efficiency of the proposed approach.
Collapse
Affiliation(s)
- Maleeha Najam
- Fatima Jinnah Women University, Rawalpindi, Pakistan
| | | | - Hafiz Farooq Ahmad
- College of Computer Sciences and Information Technology (CCSIT), King Faisal University, Alahsa 31982, Saudi Arabia
| | - Usman Ashraf
- College of Computer Sciences and Information Technology (CCSIT), King Faisal University, Alahsa 31982, Saudi Arabia
| | - Asad Waqar Malik
- School of Electrical Engineering and Computer Science, National University of Sciences and Technology (NUST), Islamabad, Pakistan
- Department of Information Systems, Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
| |
Collapse
|
20
|
Guerra A, Lotero J, Aedo JÉ, Isaza S. Tackling the Challenges of FASTQ Referential Compression. Bioinform Biol Insights 2019; 13:1177932218821373. [PMID: 30792576 PMCID: PMC6376532 DOI: 10.1177/1177932218821373] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Accepted: 11/26/2018] [Indexed: 11/16/2022] Open
Abstract
The exponential growth of genomic data has recently motivated the development of compression algorithms to tackle the storage capacity limitations in bioinformatics centers. Referential compressors could theoretically achieve a much higher compression than their non-referential counterparts; however, the latest tools have not been able to harness such potential yet. To reach such goal, an efficient encoding model to represent the differences between the input and the reference is needed. In this article, we introduce a novel approach for referential compression of FASTQ files. The core of our compression scheme consists of a referential compressor based on the combination of local alignments with binary encoding optimized for long reads. Here we present the algorithms and performance tests developed for our reads compression algorithm, named UdeACompress. Our compressor achieved the best results when compressing long reads and competitive compression ratios for shorter reads when compared to the best programs in the state of the art. As an added value, it also showed reasonable execution times and memory consumption, in comparison with similar tools.
Collapse
Affiliation(s)
- Aníbal Guerra
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
- Facultad de Ingeniería, Universidad de Antioquia (UdeA), Medellín, Colombia
| | - Jaime Lotero
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - José Édinson Aedo
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| | - Sebastián Isaza
- Facultad de Ciencias y Tecnología (FaCyT), Universidad de Carabobo (UC), Valencia, Venezuela
| |
Collapse
|
21
|
Abstract
Background:
Biological sequence data have increased at a rapid rate due to the
advancements in sequencing technologies and reduction in the cost of sequencing data. The huge
increase in these data presents significant research challenges to researchers. In addition to meaningful
analysis, data storage is also a challenge, an increase in data production is outpacing the storage
capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as
well as transmission cost over the internet.
Objective:
This article presents a novel compression algorithm (FCompress) for Next Generation
Sequencing (NGS) data in FASTQ format.
Method:
The proposed algorithm uses bits manipulation and dictionary-based compression for bases
compression. Headers are compressed with reference-based compression, whereas quality scores are
compressed with Huffman coding.
Results:
The proposed algorithm is validated with experimental results on real datasets. The results are
compared with both general purpose and specialized compression programs.
Conclusion:
The proposed algorithm produces better compression ratio in a comparable time to other
algorithms.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- Department of Computer Science, COMSATS Institute of Information Technology, Attock, Pakistan
| | - Muhammad Tahir
- Department of Computer Science, COMSATS Institute of Information Technology, Attock, Pakistan
| |
Collapse
|
22
|
Cheng KO, Law NF, Siu WC. Clustering-Based Compression for Population DNA Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:208-221. [PMID: 29028207 DOI: 10.1109/tcbb.2017.2762302] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Due to the advancement of DNA sequencing techniques, the number of sequenced individual genomes has experienced an exponential growth. Thus, effective compression of this kind of sequences is highly desired. In this work, we present a novel compression algorithm called Reference-based Compression algorithm using the concept of Clustering (RCC). The rationale behind RCC is based on the observation about the existence of substructures within the population sequences. To utilize these substructures, k-means clustering is employed to partition sequences into clusters for better compression. A reference sequence is then constructed for each cluster so that sequences in that cluster can be compressed by referring to this reference sequence. The reference sequence of each cluster is also compressed with reference to a sequence which is derived from all the reference sequences. Experiments show that RCC can further reduce the compressed size by up to 91.0 percent when compared with state-of-the-art compression approaches. There is a compromise between compressed size and processing time. The current implementation in Matlab has time complexity in a factor of thousands higher than the existing algorithms implemented in C/C++. Further investigation is required to improve processing time in future.
Collapse
|
23
|
Wasik S, Szostak N, Kudla M, Wachowiak M, Krawiec K, Blazewicz J. Detecting life signatures with RNA sequence similarity measures. J Theor Biol 2018; 463:110-120. [PMID: 30562502 DOI: 10.1016/j.jtbi.2018.12.018] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Revised: 10/25/2018] [Accepted: 12/14/2018] [Indexed: 12/20/2022]
Abstract
The RNA World is currently the most plausible hypothesis for explaining the origins of life on Earth. The supporting body of evidence is growing and it comes from multiple areas, including astrobiology, chemistry, biology, mathematics, and, in particular, from computer simulations. Such methods frequently assume the existence of a hypothetical species on Earth, around three billion years ago, with a base sequence probably dissimilar from any in known genomes. However, it is often hard to verify whether or not a hypothetical sequence has the characteristics of biological sequences, and is thus likely to be functional. The primary objective of the presented research was to verify the possibility of building a computational 'life probe' for determining whether a given genetic sequence is biological, and assessing the sensitivity of such probes to the signatures of life present in known biological sequences. We have proposed decision algorithms based on the normalized compression distance (NCD) and Levenshtein distance (LD). We have validated the proposed method in the context of the RNA World hypothesis using short genetic sequences shorter than the error threshold value (i.e., 100 nucleotides). We have demonstrated that both measures can be successfully used to construct life probes that are significantly better than a random decision procedure, while varying from each other when it comes to detailed characteristics. We also observed that fragments of sequences related to replication have better discriminatory power than sequences having other molecular functions. In a broader context, this shows that the signatures of life in short RNA samples can be effectively detected using relatively simple means.
Collapse
Affiliation(s)
- Szymon Wasik
- Institute of Computing Science, Poznan University of Technology, Poznan, Poland; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland; European Centre for Bioinformatics and Genomics, Poznan, Poland.
| | - Natalia Szostak
- Institute of Computing Science, Poznan University of Technology, Poznan, Poland; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland; European Centre for Bioinformatics and Genomics, Poznan, Poland
| | - Mateusz Kudla
- Institute of Computing Science, Poznan University of Technology, Poznan, Poland
| | - Michal Wachowiak
- Institute of Computing Science, Poznan University of Technology, Poznan, Poland
| | - Krzysztof Krawiec
- Institute of Computing Science, Poznan University of Technology, Poznan, Poland
| | - Jacek Blazewicz
- Institute of Computing Science, Poznan University of Technology, Poznan, Poland; Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland; European Centre for Bioinformatics and Genomics, Poznan, Poland
| |
Collapse
|
24
|
Wandelt S, Sun X, Leser U. Column-wise compression of open relational data. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.04.074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
25
|
Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. J Comput Biol 2018; 25:825-836. [PMID: 30011247 DOI: 10.1089/cmb.2018.0068] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The advent of high throughput sequencing (HTS) technologies raises a major concern about storage and transmission of data produced by these technologies. In particular, large-scale sequencing projects generate an unprecedented volume of genomic sequences ranging from tens to several thousands of genomes per species. These collections contain highly similar and redundant sequences, also known as pangenomes. The ideal way to represent and transfer pangenomes is through compression. A number of HTS-specific compression tools have been developed to reduce the storage and communication costs of HTS data, yet none of them is designed to process a pangenome. In this article, we present dynamic alignment-free and reference-free read compression (DARRC), a new alignment-free and reference-free compression method. It addresses the problem of pangenome compression by encoding the sequences of a pangenome as a guided de Bruijn graph. The novelty of this method is its ability to incrementally update DARRC archives with new genome sequences without full decompression of the archive. DARRC can compress both single-end and paired-end read sequences of any length using all symbols of the IUPAC nucleotide code. On a large Pseudomonas aeruginosa data set, our method outperforms all other tested tools. It provides a 30% compression ratio improvement in single-end mode compared with the best performing state-of-the-art HTS-specific compression method in our experiments.
Collapse
Affiliation(s)
- Guillaume Holley
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany .,2 International Research Training Group 1906 "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes," Bielefeld University , Bielefeld, Germany
| | - Roland Wittler
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany .,2 International Research Training Group 1906 "Computational Methods for the Analysis of the Diversity and Dynamics of Genomes," Bielefeld University , Bielefeld, Germany
| | - Jens Stoye
- 1 Genome Informatics, Faculty of Technology, Center for Biotechnology, Bielefeld University , Bielefeld, Germany
| | - Faraz Hach
- 3 School of Computing Science, Simon Fraser University , Burnaby, Canada .,4 Department of Urologic Sciences, University of British Columbia , Vancouver, Canada .,5 Vancouver Prostate Centre , Vancouver, Canada
| |
Collapse
|
26
|
Deorowicz S, Walczyszyn J, Debudaj-Grabysz A. CoMSA: compression of protein multiple sequence alignment files. Bioinformatics 2018; 35:227-234. [DOI: 10.1093/bioinformatics/bty619] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Accepted: 07/10/2018] [Indexed: 11/13/2022] Open
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Joanna Walczyszyn
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Agnieszka Debudaj-Grabysz
- Institute of Informatics, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
27
|
Lin J, Adjeroh DA, Jiang BH, Jiang Y. K2 and K2*: efficient alignment-free sequence similarity measurement based on Kendall statistics. Bioinformatics 2018; 34:1682-1689. [PMID: 29253072 PMCID: PMC6355110 DOI: 10.1093/bioinformatics/btx809] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2017] [Revised: 12/11/2017] [Accepted: 12/14/2017] [Indexed: 11/13/2022] Open
Abstract
Motivation Alignment-free sequence comparison methods can compute the pairwise similarity between a huge number of sequences much faster than sequence-alignment based methods. Results We propose a new non-parametric alignment-free sequence comparison method, called K2, based on the Kendall statistics. Comparing to the other state-of-the-art alignment-free comparison methods, K2 demonstrates competitive performance in generating the phylogenetic tree, in evaluating functionally related regulatory sequences, and in computing the edit distance (similarity/dissimilarity) between sequences. Furthermore, the K2 approach is much faster than the other methods. An improved method, K2*, is also proposed, which is able to determine the appropriate algorithmic parameter (length) automatically, without first considering different values. Comparative analysis with the state-of-the-art alignment-free sequence similarity methods demonstrates the superiority of the proposed approaches, especially with increasing sequence length, or increasing dataset sizes. Availability and implementation The K2 and K2* approaches are implemented in the R language as a package and is freely available for open access (http://community.wvu.edu/daadjeroh/projects/K2/K2_1.0.tar.gz). Contact yueljiang@163.com. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jie Lin
- Department of Software engineering, College of Mathematics and
Informatics, Fujian Normal University, Fuzhou, China
| | - Donald A Adjeroh
- Department of Computer Science & Electrical Engineering, West
Virginia University, Morgantown, WV, USA
| | - Bing-Hua Jiang
- Department of Pathology, Carver College of Medicine, The University of
Iowa, Iowa City, IA, USA
| | - Yue Jiang
- Department of Software engineering, College of Mathematics and
Informatics, Fujian Normal University, Fuzhou, China
| |
Collapse
|
28
|
Cheng H, Wu M, Xu Y. FMtree: a fast locating algorithm of FM-indexes for genomic data. Bioinformatics 2018; 34:416-424. [PMID: 28968761 DOI: 10.1093/bioinformatics/btx596] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2017] [Accepted: 09/16/2017] [Indexed: 11/15/2022] Open
Abstract
Motivation As a fundamental task in bioinformatics, searching for massive short patterns over a long text has been accelerated by various compressed full-text indexes. These indexes are able to provide similar searching functionalities to classical indexes, e.g. suffix trees and suffix arrays, while requiring less space. For genomic data, a well-known family of compressed full-text indexes, called FM-indexes, presents unmatched performance in practice. One major drawback of FM-indexes is that their locating operations, which report all occurrence positions of patterns in a given text, are not efficient, especially for the patterns with many occurrences. Results In this paper, we introduce a novel locating algorithm, FMtree, to fast retrieve all occurrence positions of any pattern via FM-indexes. When searching for a pattern over a given text, FMtree organizes the search space of the locating operation into a conceptual multiway tree. As a result, multiple occurrence positions of this pattern can be retrieved simultaneously by traversing the multiway tree. Compared with existing locating algorithms, our tree-based algorithm reduces large numbers of redundant operations and presents better data locality. Experimental results show that FMtree is usually one order of magnitude faster than the state-of-the-art algorithms, and still memory-efficient. Availability and implementation FMtree is freely available at https://github.com/chhylp123/FMtree. Contact xuyun@ustc.edu.cn. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haoyu Cheng
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, China
- Key Laboratory on High Performance Computing, Anhui Province
- Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073, China
| | - Ming Wu
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, China
- Key Laboratory on High Performance Computing, Anhui Province
| | - Yun Xu
- School of Computer Science, University of Science and Technology of China, Heifei, Anhui 230027, China
- Key Laboratory on High Performance Computing, Anhui Province
- Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073, China
| |
Collapse
|
29
|
Holley G, Wittler R, Stoye J, Hach F. Dynamic Alignment-Free and Reference-Free Read Compression. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-56970-3_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
30
|
Comparison of high-throughput sequencing data compression tools. Nat Methods 2016; 13:1005-1008. [PMID: 27776113 DOI: 10.1038/nmeth.4037] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/01/2016] [Indexed: 12/27/2022]
Abstract
High-throughput sequencing (HTS) data are commonly stored as raw sequencing reads in FASTQ format or as reads mapped to a reference, in SAM format, both with large memory footprints. Worldwide growth of HTS data has prompted the development of compression methods that aim to significantly reduce HTS data size. Here we report on a benchmarking study of available compression methods on a comprehensive set of HTS data using an automated framework.
Collapse
|
31
|
Ke R, Mignardi M, Hauling T, Nilsson M. Fourth Generation of Next-Generation Sequencing Technologies: Promise and Consequences. Hum Mutat 2016; 37:1363-1367. [PMID: 27406789 PMCID: PMC5111608 DOI: 10.1002/humu.23051] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Revised: 06/08/2016] [Accepted: 07/07/2016] [Indexed: 01/02/2023]
Abstract
In this review, we discuss the emergence of the fourth‐generation sequencing technologies that preserve the spatial coordinates of RNA and DNA sequences with up to subcellular resolution, thus enabling back mapping of sequencing reads to the original histological context. This information is used, for example, in two current large‐scale projects that aim to unravel the function of the brain. Also in cancer research, fourth‐generation sequencing has the potential to revolutionize the field. Cancer Research UK has named “Mapping the molecular and cellular tumor microenvironment in order to define new targets for therapy and prognosis” one of the grand challenges in tumor biology. We discuss the advantages of sequencing nucleic acids directly in fixed cells over traditional next‐generation sequencing (NGS) methods, the limitations and challenges that these new methods have to face to become broadly applicable, and the impact that the information generated by the combination of in situ sequencing and NGS methods will have in research and diagnostics.
Collapse
Affiliation(s)
- Rongqin Ke
- School of Biomedical Sciences, Huaqiao University, Quanzhou, Fujian, 362021, China
| | - Marco Mignardi
- Department of Information Technology, Centre for Image Analysis, Science for Life Laboratory, Uppsala University, Uppsala, SE-75105, Sweden.,Department of Bioengineering, Stanford University, Stanford, California, 75105
| | - Thomas Hauling
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Solna, SE-171 21, Sweden
| | - Mats Nilsson
- Department of Biochemistry and Biophysics, Science for Life Laboratory, Stockholm University, Solna, SE-171 21, Sweden
| |
Collapse
|
32
|
Wu TD. Bitpacking techniques for indexing genomes: I. Hash tables. Algorithms Mol Biol 2016; 11:5. [PMID: 27095998 PMCID: PMC4835851 DOI: 10.1186/s13015-016-0069-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Accepted: 04/01/2016] [Indexed: 11/20/2022] Open
Abstract
Background Hash tables constitute a widely used data structure for indexing genomes that provides a list of genomic positions for each possible oligomer of a given size. The offset array in a hash table grows exponentially with the oligomer size and precludes the use of larger oligomers that could facilitate rapid alignment of sequences to a genome. Results We propose to compress the offset array using vectorized bitpacking. We introduce an algorithm and data structure called BP64-columnar that achieves fast random access in arrays of monotonically nondecreasing integers. Experimental results based on hash tables for the fly, chicken, and human genomes show that BP64-columnar is 3 to 4 times faster than publicly available implementations of universal coding schemes, such as Elias gamma, Elias delta, and Fibonacci compression. Furthermore, among vectorized bitpacking schemes, our BP64-columnar format yields retrieval times that are faster than the fastest known bitpacking format by a factor of 3 for retrieving a single value, and a factor of 2 for retrieving two adjacent values. Conclusions Our BP64-columnar scheme enables compression of genomic hash tables with fast retrieval. It also has potential applications to other domains requiring differential coding with random access. Electronic supplementary material The online version of this article (doi:10.1186/s13015-016-0069-5) contains supplementary material, which is available to authorized users.
Collapse
|
33
|
Kim M, Zhang X, Ligo JG, Farnoud F, Veeravalli VV, Milenkovic O. MetaCRAM: an integrated pipeline for metagenomic taxonomy identification and compression. BMC Bioinformatics 2016; 17:94. [PMID: 26895947 PMCID: PMC4759986 DOI: 10.1186/s12859-016-0932-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 02/02/2016] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1-10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. RESULTS MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. CONCLUSIONS We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. AVAILABILITY The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.
Collapse
Affiliation(s)
- Minji Kim
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, USA.
| | - Xiejia Zhang
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, USA.
| | - Jonathan G Ligo
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, USA.
| | - Farzad Farnoud
- Department of Electrical Engineering, California Institute of Technology, Pasadena, 91125, USA.
| | - Venugopal V Veeravalli
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, USA.
| | - Olgica Milenkovic
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, 61801, USA.
| |
Collapse
|
34
|
Sardaraz M, Tahir M, Ikram AA. Advances in high throughput DNA sequence data compression. J Bioinform Comput Biol 2015; 14:1630002. [PMID: 26846812 DOI: 10.1142/s0219720016300021] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Advances in high throughput sequencing technologies and reduction in cost of sequencing have led to exponential growth in high throughput DNA sequence data. This growth has posed challenges such as storage, retrieval, and transmission of sequencing data. Data compression is used to cope with these challenges. Various methods have been developed to compress genomic and sequencing data. In this article, we present a comprehensive review of compression methods for genome and reads compression. Algorithms are categorized as referential or reference free. Experimental results and comparative analysis of various methods for data compression are presented. Finally, key challenges and research directions in DNA sequence data compression are highlighted.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Muhammad Tahir
- 1 Department of Computer Science, University of Wah, Quaid Avenue, Wah Cantt 47040, Pakistan
| | - Ataul Aziz Ikram
- 2 Department of Electrical Engineering, National University, Islamabad 44000, Pakistan
| |
Collapse
|
35
|
Cheng KO, Wu P, Law NF, Siu WC. Compression of Multiple DNA Sequences Using Intra-Sequence and Inter-Sequence Similarities. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1322-1332. [PMID: 26671804 DOI: 10.1109/tcbb.2015.2403370] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Traditionally, intra-sequence similarity is exploited for compressing a single DNA sequence. Recently, remarkable compression performance of individual DNA sequence from the same population is achieved by encoding its difference with a nearly identical reference sequence. Nevertheless, there is lack of general algorithms that also allow less similar reference sequences. In this work, we extend the intra-sequence to the inter-sequence similarity in that approximate matches of subsequences are found between the DNA sequence and a set of reference sequences. Hence, a set of nearly identical DNA sequences from the same population or a set of partially similar DNA sequences like chromosome sequences and DNA sequences of related species can be compressed together. For practical compressors, the compressed size is usually influenced by the compression order of sequences. Fast search algorithms for the optimal compression order are thus developed for multiple sequences compression. Experimental results on artificial and real datasets demonstrate that our proposed multiple sequences compression methods with fast compression order search are able to achieve good compression performance under different levels of similarity in the multiple DNA sequences.
Collapse
|
36
|
Wandelt S, Leser U. Sequence Factorization with Multiple References. PLoS One 2015; 10:e0139000. [PMID: 26422374 PMCID: PMC4589410 DOI: 10.1371/journal.pone.0139000] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2014] [Accepted: 09/07/2015] [Indexed: 11/29/2022] Open
Abstract
The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.
Collapse
Affiliation(s)
- Sebastian Wandelt
- Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany
- * E-mail:
| | - Ulf Leser
- Knowledge Management in Bioinformatics, Humboldt-University of Berlin, Rudower Chaussee 25, 12489 Berlin, Germany
| |
Collapse
|
37
|
Loeffelholz M, Fofanov Y. The main challenges that remain in applying high-throughput sequencing to clinical diagnostics. Expert Rev Mol Diagn 2015; 15:1405-8. [DOI: 10.1586/14737159.2015.1088385] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
38
|
Ochoa I, Hernaez M, Weissman T. Aligned genomic data compression via improved modeling. J Bioinform Comput Biol 2015; 12:1442002. [PMID: 25395305 DOI: 10.1142/s0219720014420025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
With the release of the latest Next-Generation Sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing the whole genome of a human is expected to drop to a mere $1000. This milestone in sequencing history marks the era of affordable sequencing of individuals and opens the doors to personalized medicine. In accord, unprecedented volumes of genomic data will require storage for processing. There will be dire need not only of compressing aligned data, but also of generating compressed files that can be fed directly to downstream applications to facilitate the analysis of and inference on the data. Several approaches to this challenge have been proposed in the literature; however, focus thus far has been on the low coverage regime and most of the suggested compressors are not based on effective modeling of the data. We demonstrate the benefit of data modeling for compressing aligned reads. Specifically, we show that, by working with data models designed for the aligned data, we can improve considerably over the best compression ratio achieved by previously proposed algorithms. Our results indicate that the pareto-optimal barrier for compression rate and speed claimed by Bonfield and Mahoney (2013) [Bonfield JK and Mahoneys MV, Compression of FASTQ and SAM format sequencing data, PLOS ONE, 8(3):e59190, 2013.] does not apply for high coverage aligned data. Furthermore, our improved compression ratio is achieved by splitting the data in a manner conducive to operations in the compressed domain by downstream applications.
Collapse
Affiliation(s)
- Idoia Ochoa
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA, USA
| | | | | |
Collapse
|
39
|
Patro R, Kingsford C. Data-dependent bucketing improves reference-free compression of sequencing reads. Bioinformatics 2015; 31:2770-7. [PMID: 25910696 PMCID: PMC4547610 DOI: 10.1093/bioinformatics/btv248] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2014] [Revised: 04/11/2015] [Accepted: 04/20/2015] [Indexed: 01/24/2023] Open
Abstract
MOTIVATION The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data. RESULTS We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes. AVAILABILITY AND IMPLEMENTATION Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince. CONTACT carlk@cs.cmu.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794-4400, USA and
| | - Carl Kingsford
- Department Computational Biology, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA
| |
Collapse
|
40
|
Alves F, Cogo V, Wandelt S, Leser U, Bessani A. On-Demand Indexing for Referential Compression of DNA Sequences. PLoS One 2015; 10:e0132460. [PMID: 26146838 PMCID: PMC4493149 DOI: 10.1371/journal.pone.0132460] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2015] [Accepted: 06/15/2015] [Indexed: 12/15/2022] Open
Abstract
The decreasing costs of genome sequencing is creating a demand for scalable storage and processing tools and techniques to deal with the large amounts of generated data. Referential compression is one of these techniques, in which the similarity between the DNA of organisms of the same or an evolutionary close species is exploited to reduce the storage demands of genome sequences up to 700 times. The general idea is to store in the compressed file only the differences between the to-be-compressed and a well-known reference sequence. In this paper, we propose a method for improving the performance of referential compression by removing the most costly phase of the process, the complete reference indexing. Our approach, called On-Demand Indexing (ODI) compresses human chromosomes five to ten times faster than other state-of-the-art tools (on average), while achieving similar compression ratios.
Collapse
Affiliation(s)
| | | | | | - Ulf Leser
- WBI, Humboldt-Universität zu Berlin, Berlin, Germany
| | | |
Collapse
|
41
|
Deorowicz S, Danek A, Niemiec M. GDC 2: Compression of large collections of genomes. Sci Rep 2015; 5:11565. [PMID: 26108279 PMCID: PMC4479802 DOI: 10.1038/srep11565] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Accepted: 05/28/2015] [Indexed: 01/18/2023] Open
Abstract
The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid in the personalized medicine. One of the significant side effects of this change is a necessity of storage and transfer of huge amounts of genomic data. In this paper we deal with the problem of compression of large collections of complete genomic sequences. We propose an algorithm that is able to compress the collection of 1092 human diploid genomes about 9,500 times. This result is about 4 times better than what is offered by the other existing compressors. Moreover, our algorithm is very fast as it processes the data with speed 200 MB/s on a modern workstation. In a consequence the proposed algorithm allows storing the complete genomic collections at low cost, e.g., the examined collection of 1092 human genomes needs only about 700 MB when compressed, what can be compared to about 6.7 TB of uncompressed FASTA files. The source code is available at http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=gdc&subpage=about.
Collapse
Affiliation(s)
- Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | - Agnieszka Danek
- Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland
| | | |
Collapse
|
42
|
Pratas D, Silva RM, Pinho AJ, Ferreira PJ. An alignment-free method to find and visualise rearrangements between pairs of DNA sequences. Sci Rep 2015; 5:10203. [PMID: 25984837 PMCID: PMC4434998 DOI: 10.1038/srep10203] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 04/07/2015] [Indexed: 12/19/2022] Open
Abstract
Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.
Collapse
|
43
|
Matos LMO, Neves AJR, Pratas D, Pinho AJ. MAFCO: a compression tool for MAF files. PLoS One 2015; 10:e0116082. [PMID: 25816229 PMCID: PMC4376647 DOI: 10.1371/journal.pone.0116082] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2014] [Accepted: 12/05/2014] [Indexed: 01/03/2023] Open
Abstract
In the last decade, the cost of genomic sequencing has been decreasing so much that researchers all over the world accumulate huge amounts of data for present and future use. These genomic data need to be efficiently stored, because storage cost is not decreasing as fast as the cost of sequencing. In order to overcome this problem, the most popular general-purpose compression tool, gzip, is usually used. However, these tools were not specifically designed to compress this kind of data, and often fall short when the intention is to reduce the data size as much as possible. There are several compression algorithms available, even for genomic data, but very few have been designed to deal with Whole Genome Alignments, containing alignments between entire genomes of several species. In this paper, we present a lossless compression tool, MAFCO, specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from 34% to 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. Both source-code and binaries for several operating systems are freely available for non-commercial use at: http://bioinformatics.ua.pt/software/mafco.
Collapse
Affiliation(s)
- Luís M. O. Matos
- Signal Processing Lab, IEETA/DETI, University of Aveiro, 3810–193 Aveiro, Portugal
- * E-mail:
| | - António J. R. Neves
- Signal Processing Lab, IEETA/DETI, University of Aveiro, 3810–193 Aveiro, Portugal
| | - Diogo Pratas
- Signal Processing Lab, IEETA/DETI, University of Aveiro, 3810–193 Aveiro, Portugal
| | - Armando J. Pinho
- Signal Processing Lab, IEETA/DETI, University of Aveiro, 3810–193 Aveiro, Portugal
| |
Collapse
|
44
|
Grabowski S, Deorowicz S, Roguski Ł. Disk-based compression of data from genome sequencing. ACTA ACUST UNITED AC 2014; 31:1389-95. [PMID: 25536966 DOI: 10.1093/bioinformatics/btu844] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2014] [Accepted: 12/17/2014] [Indexed: 11/14/2022]
Abstract
MOTIVATION High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. RESULTS We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. AVAILABILITY AND IMPLEMENTATION http://sun.aei.polsl.pl/orcom under a free license. CONTACT sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| | - Sebastian Deorowicz
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| | - Łukasz Roguski
- Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain Institute of Applied Computer Science, Lodz University of Technology, Al. Politechniki 11, 90-924 Łódź, Institute of Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Polish-Japanese Institute of Information Technology, Koszykowa 86, 02-008 Warszawa, Poland and Centro Nacional de Análisis Genómico (CNAG), 08-028 Barcelona, Spain
| |
Collapse
|
45
|
Abstract
MOTIVATION With the release of the latest next-generation sequencing (NGS) machine, the HiSeq X by Illumina, the cost of sequencing a Human has dropped to a mere $4000. Thus we are approaching a milestone in the sequencing history, known as the $1000 genome era, where the sequencing of individuals is affordable, opening the doors to effective personalized medicine. Massive generation of genomic data, including assembled genomes, is expected in the following years. There is crucial need for compression of genomes guaranteed of performing well simultaneously on different species, from simple bacteria to humans, which will ease their transmission, dissemination and analysis. Further, most of the new genomes to be compressed will correspond to individuals of a species from which a reference already exists on the database. Thus, it is natural to propose compression schemes that assume and exploit the availability of such references. RESULTS We propose iDoComp, a compressor of assembled genomes presented in FASTA format that compresses an individual genome using a reference genome for both the compression and the decompression. In terms of compression efficiency, iDoComp outperforms previously proposed algorithms in most of the studied cases, with comparable or better running time. For example, we observe compression gains of up to 60% in several cases, including H.sapiens data, when comparing with the best compression performance among the previously proposed algorithms. AVAILABILITY iDoComp is written in C and can be downloaded from: http://www.stanford.edu/~iochoa/iDoComp.html (We also provide a full explanation on how to run the program and an example with all the necessary files to run it.).
Collapse
Affiliation(s)
- Idoia Ochoa
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA
| | - Mikel Hernaez
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA
| | - Tsachy Weissman
- Department of Electrical Engineering, Stanford University, 350 Serra Mall, Stanford, CA
| |
Collapse
|
46
|
Danek A, Deorowicz S, Grabowski S. Indexes of large genome collections on a PC. PLoS One 2014; 9:e109384. [PMID: 25289699 PMCID: PMC4188820 DOI: 10.1371/journal.pone.0109384] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2014] [Accepted: 09/07/2014] [Indexed: 02/02/2023] Open
Abstract
The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16–32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.
Collapse
Affiliation(s)
- Agnieszka Danek
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
- * E-mail:
| | - Szymon Grabowski
- Institute of Applied Computer Science, Lodz University of Technology, Łódź, Poland
| |
Collapse
|
47
|
Abstract
Reference-free SNP detection, that is identifying SNPs between samples directly from comparison of primary sequencing data with other primary sequencing data and not to a pre-assembled reference genome is an emergent and potentially disruptive technology that is beginning to open up new vistas in variant identification that reveals new applications in non-model organisms and metagenomics. The modern, effcient data structures these tools use enables researchers with a reference sequence to sample many more individuals with lower computing storage and processing overhead. In this article we will discuss the technologies and tools implementing reference-free SNP detection and the potential impact on studies of genetic variation in model and non-model organisms, metagenomics and personal genomics and medicine.
Collapse
|
48
|
|