Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	[Subscribe] [Scholar Register]

Number

Cited by Other Article(s)

Silva JM, Almeida JR. Enhancing metagenomic classification with compression-based features. Artif Intell Med 2024;156:102948. [PMID: 39173422 DOI: 10.1016/j.artmed.2024.102948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 06/12/2024] [Accepted: 08/13/2024] [Indexed: 08/24/2024]

Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.15.536996. [PMID: 37131636 PMCID: PMC10153118 DOI: 10.1101/2023.04.15.536996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Ferraro Petrillo U, Palini F, Cattaneo G, Giancarlo R. FASTA/Q data compressors for MapReduce-Hadoop genomics: space and time savings made easy. BMC Bioinformatics 2021;22:144. [PMID: 33752596 PMCID: PMC7986029 DOI: 10.1186/s12859-021-04063-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Accepted: 03/04/2021] [Indexed: 11/10/2022] Open

Silva M, Pratas D, Pinho AJ. Efficient DNA sequence compression with neural networks. Gigascience 2020;9:giaa119. [PMID: 33179040 PMCID: PMC7657843 DOI: 10.1093/gigascience/giaa119] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 08/19/2020] [Accepted: 10/02/2020] [Indexed: 12/11/2022] Open

Abstract

BACKGROUND

The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models.

FINDINGS

We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art.

CONCLUSIONS

GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

Collapse

Chen J, Cai Y, Xu R, Pan J, Zhou J, Mei J. Identification of four hub genes as promising biomarkers to evaluate the prognosis of ovarian cancer in silico. Cancer Cell Int 2020;20:270. [PMID: 32595417 PMCID: PMC7315561 DOI: 10.1186/s12935-020-01361-1] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Accepted: 06/17/2020] [Indexed: 12/23/2022] Open

Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review. PLoS One 2020;15:e0232942. [PMID: 32453750 PMCID: PMC7250429 DOI: 10.1371/journal.pone.0232942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 04/25/2020] [Indexed: 11/19/2022] Open

Ferraro Petrillo U, Roscigno G, Cattaneo G, Giancarlo R. Informational and linguistic analysis of large genomic sequence collections via efficient Hadoop cluster algorithms. Bioinformatics 2019;34:1826-1833. [PMID: 29342232 DOI: 10.1093/bioinformatics/bty018] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2017] [Accepted: 01/09/2018] [Indexed: 02/03/2023] Open

Abstract

Motivation

Information theoretic and compositional/linguistic analysis of genomes have a central role in bioinformatics, even more so since the associated methodologies are becoming very valuable also for epigenomic and meta-genomic studies. The kernel of those methods is based on the collection of k-mer statistics, i.e. how many times each k-mer in {A,C,G,T}k occurs in a DNA sequence. Although this problem is computationally very simple and efficiently solvable on a conventional computer, the sheer amount of data available now in applications demands to resort to parallel and distributed computing. Indeed, those type of algorithms have been developed to collect k-mer statistics in the realm of genome assembly. However, they are so specialized to this domain that they do not extend easily to the computation of informational and linguistic indices, concurrently on sets of genomes.

Results

Following the well-established approach in many disciplines, and with a growing success also in bioinformatics, to resort to MapReduce and Hadoop to deal with 'Big Data' problems, we present KCH, the first set of MapReduce algorithms able to perform concurrently informational and linguistic analysis of large collections of genomic sequences on a Hadoop cluster. The benchmarking of KCH that we provide indicates that it is quite effective and versatile. It is also competitive with respect to the parallel and distributed algorithms highly specialized to k-mer statistics collection for genome assembly problems. In conclusion, KCH is a much needed addition to the growing number of algorithms and tools that use MapReduce for bioinformatics core applications.

Availability and implementation

The software, including instructions for running it over Amazon AWS, as well as the datasets are available at http://www.di-srv.unisa.it/KCH.

Contact

umberto.ferraro@uniroma1.it.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Pathak S, Rajasekaran S. RETRACTED: LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 2019;35:e1-e7. [PMID: 31051040 PMCID: PMC7651991 DOI: 10.1093/bioinformatics/btu701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2014] [Revised: 10/16/2014] [Accepted: 10/20/2014] [Indexed: 11/14/2022] Open

Ferraro Petrillo U, Sorella M, Cattaneo G, Giancarlo R, Rombo SE. Analyzing big datasets of genomic sequences: fast and scalable collection of k-mer statistics. BMC Bioinformatics 2019;20:138. [PMID: 30999863 PMCID: PMC6471689 DOI: 10.1186/s12859-019-2694-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open

Abstract

Background

Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k.

Results

One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability.

Conclusions

We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.

Collapse

Exploring variable-length time series motifs in one hundred million length scale. Data Min Knowl Discov 2018. [DOI: 10.1007/s10618-018-0570-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]

Beal R, Afrin T, Farheen A, Adjeroh D. A new algorithm for "the LCS problem" with application in compressing genome resequencing data. BMC Genomics 2016;17 Suppl 4:544. [PMID: 27556803 PMCID: PMC5001248 DOI: 10.1186/s12864-016-2793-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open

Eric PV, Gopalakrishnan G, Karunakaran M. An Optimal Seed Based Compression Algorithm for DNA Sequences. Adv Bioinformatics 2016;2016:3528406. [PMID: 27555868 PMCID: PMC4983397 DOI: 10.1155/2016/3528406] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Revised: 05/09/2016] [Accepted: 06/19/2016] [Indexed: 11/26/2022] Open

Wu TD. Bitpacking techniques for indexing genomes: I. Hash tables. Algorithms Mol Biol 2016;11:5. [PMID: 27095998 PMCID: PMC4835851 DOI: 10.1186/s13015-016-0069-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2015] [Accepted: 04/01/2016] [Indexed: 11/20/2022] Open

Utro F, Di Benedetto V, Corona DF, Giancarlo R. The intrinsic combinatorial organization and information theoretic content of a sequence are correlated to the DNA encoded nucleosome organization of eukaryotic genomes. Bioinformatics 2015;32:835-42. [DOI: 10.1093/bioinformatics/btv679] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2015] [Accepted: 11/09/2015] [Indexed: 11/14/2022] Open

Abstract Abstract Motivation: Thanks to research spanning nearly 30 years, two major models have emerged that account for nucleosome organization in chromatin: statistical and sequence specific. The first is based on elegant, easy to compute, closed-form mathematical formulas that make no assumptions of the physical and chemical properties of the underlying DNA sequence. Moreover, they need no training on the data for their computation. The latter is based on some sequence regularities but, as opposed to the statistical model, it lacks the same type of closed-form formulas that, in this case, should be based on the DNA sequence only. Results: We contribute to close this important methodological gap between the two models by providing three very simple formulas for the sequence specific one. They are all based on well-known formulas in Computer Science and Bioinformatics, and they give different quantifications of how complex a sequence is. In view of how remarkably well they perform, it is very surprising that measures of sequence complexity have not even been considered as candidates to close the mentioned gap. We provide experimental evidence that the intrinsic level of combinatorial organization and information-theoretic content of subsequences within a genome are strongly correlated to the level of DNA encoded nucleosome organization discovered by Kaplan et al. Our results establish an important connection between the intrinsic complexity of subsequences in a genome and the intrinsic, i.e. DNA encoded, nucleosome organization of eukaryotic genomes. It is a first step towards a mathematical characterization of this latter ‘encoding’. Supplementary information: Supplementary data are available at Bioinformatics online. Contact: futro@us.ibm.com. Collapse

Matthews SJ. Heterogeneous Compression of Large Collections of Evolutionary Trees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015;12:807-814. [PMID: 26357320 DOI: 10.1109/tcbb.2014.2366756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]

Nicolae M, Pathak S, Rajasekaran S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics 2015;31:3276-81. [PMID: 26093148 DOI: 10.1093/bioinformatics/btv384] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Accepted: 06/06/2015] [Indexed: 12/30/2022] Open

Giancarlo R, Rombo SE, Utro F. Epigenomick-mer dictionaries: shedding light on how sequence composition influencesin vivonucleosome positioning. Bioinformatics 2015;31:2939-46. [DOI: 10.1093/bioinformatics/btv295] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2014] [Accepted: 05/04/2015] [Indexed: 12/28/2022] Open

Weitschek E, Santoni D, Fiscon G, De Cola MC, Bertolazzi P, Felici G. Next generation sequencing reads comparison with an alignment-free distance. BMC Res Notes 2014;7:869. [PMID: 25465386 PMCID: PMC4265526 DOI: 10.1186/1756-0500-7-869] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2014] [Accepted: 11/20/2014] [Indexed: 11/15/2022] Open

Abstract

Background

Next Generation Sequencing (NGS) machines extract from a biological sample a large number of short DNA fragments (reads). These reads are then used for several applications, e.g., sequence reconstruction, DNA assembly, gene expression profiling, mutation analysis.

Methods

We propose a method to evaluate the similarity between reads. This method does not rely on the alignment of the reads and it is based on the distance between the frequencies of their substrings of fixed dimensions (k-mers). We compare this alignment-free distance with the similarity measures derived from two alignment methods: Needleman-Wunsch and Blast. The comparison is based on a simple assumption: the most correct distance is obtained by knowing in advance the reference sequence. Therefore, we first align the reads on the original DNA sequence, compute the overlap between the aligned reads, and use this overlap as an ideal distance. We then verify how the alignment-free and the alignment-based distances reproduce this ideal distance. The ability of correctly reproducing the ideal distance is evaluated over samples of read pairs from Saccharomyces cerevisiae, Escherichia coli, and Homo sapiens. The comparison is based on the correctness of threshold predictors cross-validated over different samples.

Results

We exhibit experimental evidence that the proposed alignment-free distance is a potentially useful read-to-read distance measure and performs better than the more time consuming distances based on alignment.

Conclusions

Alignment-free distances may be used effectively for reads comparison, and may provide a significant speed-up in several processes based on NGS sequencing (e.g., DNA assembly, reads classification).

Electronic supplementary material

The online version of this article (doi:10.1186/1756-0500-7-869) contains supplementary material, which is available to authorized users.

Collapse

Cánovas R, Moffat A, Turpin A. Lossy compression of quality scores in genomic data. Bioinformatics 2014;30:2130-6. [DOI: 10.1093/bioinformatics/btu183] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014;15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open

Giancarlo R, Rombo SE, Utro F. Compressive biological sequence analysis and archival in the era of high-throughput sequencing technologies. Brief Bioinform 2013;15:390-406. [DOI: 10.1093/bib/bbt088] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open

Zhu Z, Zhang Y, Ji Z, He S, Yang X. High-throughput DNA sequence data compression. Brief Bioinform 2013;16:1-15. [PMID: 24300111 DOI: 10.1093/bib/bbt087] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open

Deorowicz S, Grabowski S. Data compression for sequencing data. Algorithms Mol Biol 2013;8:25. [PMID: 24252160 PMCID: PMC3868316 DOI: 10.1186/1748-7188-8-25] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2013] [Accepted: 09/25/2013] [Indexed: 12/12/2022] Open

Schwende I, Pham TD. Pattern recognition and probabilistic measures in alignment-free sequence analysis. Brief Bioinform 2013;15:354-68. [PMID: 24096012 DOI: 10.1093/bib/bbt070] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open

Ury AG. Storing and interpreting genomic information in widely deployed electronic health record systems. Genet Med 2013;15:779-85. [DOI: 10.1038/gim.2013.111] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2013] [Accepted: 06/24/2013] [Indexed: 01/19/2023] Open

Compressing resequencing data with GReEn. Methods Mol Biol 2013. [PMID: 23872967 DOI: 10.1007/978-1-62703-514-9_2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]

Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLoS One 2013;8:e59190. [PMID: 23533605 PMCID: PMC3606433 DOI: 10.1371/journal.pone.0059190] [Citation(s) in RCA: 149] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2012] [Accepted: 02/12/2013] [Indexed: 12/17/2022] Open

Comin M, Verzotto D. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol 2012;7:34. [PMID: 23216990 PMCID: PMC3549825 DOI: 10.1186/1748-7188-7-34] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2012] [Accepted: 11/29/2012] [Indexed: 11/24/2022] Open

Abstract

Background

With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods.

Methods

In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of “independent” subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords.

Results

The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information.

Availability

http://www.dei.unipd.it/∼ciompin/main/underlying.html

Collapse

A novel statistical measure for sequence comparison on the basis of k-word counts. J Theor Biol 2012;318:91-100. [PMID: 23147229 DOI: 10.1016/j.jtbi.2012.10.035] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2011] [Revised: 10/10/2012] [Accepted: 10/31/2012] [Indexed: 11/24/2022]

Popitsch N, von Haeseler A. NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res 2012;41:e27. [PMID: 23066097 PMCID: PMC3592443 DOI: 10.1093/nar/gks939] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023] Open

Mohammed MH, Dutta A, Bose T, Chadaram S, Mande SS. DELIMINATE--a fast and efficient method for loss-less compression of genomic sequences: sequence analysis. ACTA ACUST UNITED AC 2012;28:2527-9. [PMID: 22833526 DOI: 10.1093/bioinformatics/bts467] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Cox AJ, Bauer MJ, Jakobi T, Rosone G. Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics 2012;28:1415-9. [PMID: 22556365 DOI: 10.1093/bioinformatics/bts173] [Citation(s) in RCA: 100] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open

Abstract

MOTIVATION

The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being widely applied to the large sets of sequences often encountered as the outcome of DNA sequencing experiments. In previous work, we presented a novel algorithm that allows the BWT of human genome scale data to be computed on very moderate hardware, thus enabling us to investigate the BWT as a tool for the compression of such datasets.

RESULTS

We first used simulated reads to explore the relationship between the level of compression and the error rate, the length of the reads and the level of sampling of the underlying genome and compare choices of second-stage compression algorithm. We demonstrate that compression may be greatly improved by a particular reordering of the sequences in the collection and give a novel 'implicit sorting' strategy that enables these benefits to be realized without the overhead of sorting the reads. With these techniques, a 45× coverage of real human genome sequence data compresses losslessly to under 0.5 bits per base, allowing the 135.3 Gb of sequence to fit into only 8.2 GB of space (trimming a small proportion of low-quality bases from the reads improves the compression still further). This is >4 times smaller than the size achieved by a standard BWT-based compressor (bzip2) on the untrimmed reads, but an important further advantage of our approach is that it facilitates the building of compressed full text indexes such as the FM-index on large-scale DNA sequence collections.

AVAILABILITY

Code to construct the BWT and SAP-array on large genomic datasets is part of the BEETL library, available as a github repository at https://github.com/BEETL/BEETL.

Collapse

Pinho AJ, Pratas D, Garcia SP. GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res 2012;40:e27. [PMID: 22139935 PMCID: PMC3287168 DOI: 10.1093/nar/gkr1124] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2011] [Revised: 10/17/2011] [Accepted: 11/08/2011] [Indexed: 12/22/2022] Open

Wan R, Anh VN, Asai K. Transformations for the compression of FASTQ quality scores of next-generation sequencing data. ACTA ACUST UNITED AC 2011;28:628-35. [PMID: 22171329 DOI: 10.1093/bioinformatics/btr689] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]

Pinho AJ, Ferreira PJSG, Neves AJR, Bastos CAC. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 2011;6:e21588. [PMID: 21738720 PMCID: PMC3128062 DOI: 10.1371/journal.pone.0021588] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2010] [Accepted: 06/06/2011] [Indexed: 11/19/2022] Open

Abstract

A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.

Collapse

Miller CA, Settle SH, Sulman EP, Aldape KD, Milosavljevic A. Discovering functional modules by identifying recurrent and mutually exclusive mutational patterns in tumors. BMC Med Genomics 2011;4:34. [PMID: 21489305 PMCID: PMC3102606 DOI: 10.1186/1755-8794-4-34] [Citation(s) in RCA: 87] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2010] [Accepted: 04/14/2011] [Indexed: 11/10/2022] Open

Mian IS, Rose C. Communication theory and multicellular biology. Integr Biol (Camb) 2011;3:350-67. [PMID: 21424025 DOI: 10.1039/c0ib00117a] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]

Abstract

In this Perspective, we propose that communication theory--a field of mathematics concerned with the problems of signal transmission, reception and processing--provides a new quantitative lens for investigating multicellular biology, ancient and modern. What underpins the cohesive organisation and collective behaviour of multicellular ecosystems such as microbial colonies and communities (microbiomes) and multicellular organisms such as plants and animals, whether built of simple tissue layers (sponges) or of complex differentiated cells arranged in tissues and organs (members of the 35 or so phyla of the subkingdom Metazoa)? How do mammalian tissues and organs develop, maintain their architecture, become subverted in disease, and decline with age? How did single-celled organisms coalesce to produce many-celled forms that evolved and diversified into the varied multicellular organisms in existence today? Some answers can be found in the blueprints or recipes encoded in (epi)genomes, yet others lie in the generic physical properties of biological matter such as the ability of cell aggregates to attain a certain complexity in size, shape, and pattern. We suggest that Lasswell's maxim "Who says what to whom in what channel with what effect" provides a foundation for understanding not only the emergence and evolution of multicellularity, but also the assembly and sculpting of multicellular ecosystems and many-celled structures, whether of natural or human-engineered origin. We explore how the abstraction of communication theory as an organising principle for multicellular biology could be realised. We highlight the inherent ability of communication theory to be blind to molecular and/or genetic mechanisms. We describe selected applications that analyse the physics of communication and use energy efficiency as a central tenet. Whilst communication theory has and could contribute to understanding a myriad of problems in biology, investigations of multicellular biology could, in turn, lead to advances in communication theory, especially in the still immature field of network information theory.

Collapse

Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. ACTA ACUST UNITED AC 2011;27:860-2. [PMID: 21252073 DOI: 10.1093/bioinformatics/btr014] [Citation(s) in RCA: 117] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Apostolico A, Cunial F. The subsequence composition of polypeptides. J Comput Biol 2010;17:1011-49. [PMID: 20666621 DOI: 10.1089/cmb.2010.0073] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Data Compression Concepts and Algorithms and their Applications to Bioinformatics. ENTROPY 2009;12:34. [PMID: 20157640 DOI: 10.3390/e12010034] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]

Linear-Time Text Compression by Longest-First Substitution. ALGORITHMS 2009. [DOI: 10.3390/a2041429] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]