Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Almeida JS, Vinga S. Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol 2006;1:18. [PMID: 17049089 PMCID: PMC1630425 DOI: 10.1186/1748-7188-1-18] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2006] [Accepted: 10/18/2006] [Indexed: 11/10/2022] Open

For:	Almeida JS, Vinga S. Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol 2006;1:18. [PMID: 17049089 PMCID: PMC1630425 DOI: 10.1186/1748-7188-1-18] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2006] [Accepted: 10/18/2006] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Bohnsack KS, Kaden M, Abel J, Villmann T. Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023;20:119-135. [PMID: 34990369 DOI: 10.1109/tcbb.2022.3140873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]

Karamichalis R, Kari L, Konstantinidis S, Kopecki S, Solis-Reyes S. Additive methods for genomic signatures. BMC Bioinformatics 2016;17:313. [PMID: 27549194 PMCID: PMC4994249 DOI: 10.1186/s12859-016-1157-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 07/19/2016] [Indexed: 01/09/2023] Open

Abstract

Background

Studies exploring the potential of Chaos Game Representations (CGR) of genomic sequences to act as “genomic signatures” (to be species- and genome-specific) showed that CGR patterns of nuclear and organellar DNA sequences of the same organism can be very different. While the hypothesis that CGRs of mitochondrial DNA sequences can act as genomic signatures was validated for a snapshot of all sequenced mitochondrial genomes available in the NCBI GenBank sequence database, to our knowledge no such extensive analysis of CGRs of nuclear DNA sequences exists to date.

Results

We analyzed an extensive dataset, totalling 1.45 gigabase pairs, of nuclear/nucleoid genomic sequences (nDNA) from 42 different organisms, spanning all major kingdoms of life. Our computational experiments indicate that CGR signatures of nDNA of two different origins cannot always be differentiated, especially if they originate from closely-related species such as H. sapiens and P. troglodytes or E. coli and E. fergusonii. To address this issue, we propose the general concept of additive DNA signature of a set (collection) of DNA sequences. One particular instance, the composite DNA signature, combines information from nDNA fragments and organellar (mitochondrial, chloroplast, or plasmid) genomes. We demonstrate that, in this dataset, composite DNA signatures originating from two different organisms can be differentiated in all cases, including those where the use of CGR signatures of nDNA failed or was inconclusive. Another instance, the assembled DNA signature, combines information from many short DNA subfragments (e.g., 100 basepairs) of a given DNA fragment, to produce its signature. We show that an assembled DNA signature has the same distinguishing power as a conventionally computed CGR signature, while using shorter contiguous sequences and potentially less sequence information.

Conclusions

Our results suggest that, while CGR signatures of nDNA cannot always play the role of genomic signatures, composite and assembled DNA signatures (separately or in combination) could potentially be used instead. Such additive signatures could be used, e.g., with raw unassembled next-generation sequencing (NGS) read data, when high-quality sequencing data is not available, or to complement information obtained by other methods of species identification or classification.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-016-1157-8) contains supplementary material, which is available to authorized users.

Collapse

Karamichalis R, Kari L, Konstantinidis S, Kopecki S. An investigation into inter- and intragenomic variations of graphic genomic signatures. BMC Bioinformatics 2015;16:246. [PMID: 26249837 PMCID: PMC4527362 DOI: 10.1186/s12859-015-0655-4] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2014] [Accepted: 06/30/2015] [Indexed: 11/30/2022] Open

Abstract

Background

Motivated by the general need to identify and classify species based on molecular evidence, genome comparisons have been proposed that are based on measuring mostly Euclidean distances between Chaos Game Representation (CGR) patterns of genomic DNA sequences.

Results

We provide, on an extensive dataset and using several different distances, confirmation of the hypothesis that CGR patterns are preserved along a genomic DNA sequence, and are different for DNA sequences originating from genomes of different species. This finding lends support to the theory that CGRs of genomic sequences can act as graphic genomic signatures. In particular, we compare the CGR patterns of over five hundred different 150,000 bp genomic sequences spanning one complete chromosome from each of six organisms, representing all kingdoms of life: H. sapiens (Animalia; chromosome 21), S. cerevisiae (Fungi; chromosome 4), A. thaliana (Plantae; chromosome 1), P. falciparum (Protista; chromosome 14), E. coli (Bacteria - full genome), and P. furiosus (Archaea - full genome). To maximize the diversity within each species, we also analyze the interrelationships within a set of over five hundred 150,000 bp genomic sequences sampled from the entire aforementioned genomes. Lastly, we provide some preliminary evidence of this method’s ability to classify genomic DNA sequences at lower taxonomic levels by comparing sequences sampled from the entire genome of H. sapiens (class Mammalia, order Primates) and of M. musculus (class Mammalia, order Rodentia), for a total length of approximately 174 million basepairs analyzed. We compute pairwise distances between CGRs of these genomic sequences using six different distances, and construct Molecular Distance Maps, which visualize all sequences as points in a two-dimensional or three-dimensional space, to simultaneously display their interrelationships.

Conclusion

Our analysis confirms, for this dataset, that CGR patterns of DNA sequences from the same genome are in general quantitatively similar, while being different for DNA sequences from genomes of different species. Our assessment of the performance of the six distances analyzed uses three different quality measures and suggests that several distances outperform the Euclidean distance, which has so far been almost exclusively used for such studies.

Collapse

Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014;15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open

Almeida JS. Sequence analysis by iterated maps, a review. Brief Bioinform 2013;15:369-75. [PMID: 24162172 DOI: 10.1093/bib/bbt072] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Almeida JS, Grüneberg A, Maass W, Vinga S. Fractal MapReduce decomposition of sequence alignment. Algorithms Mol Biol 2012;7:12. [PMID: 22551205 PMCID: PMC3394223 DOI: 10.1186/1748-7188-7-12] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2011] [Accepted: 05/02/2012] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.

RESULTS

In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.

CONCLUSIONS

The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.

AVAILABILITY

Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".

Collapse

Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis. Algorithms Mol Biol 2012;7:10. [PMID: 22551152 PMCID: PMC3402988 DOI: 10.1186/1748-7188-7-10] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2012] [Accepted: 05/02/2012] [Indexed: 01/06/2023] Open

Abstract

Background

Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2^-Ldistance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.

Results

The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.

Conclusions

The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.

Collapse

Feng J, Hu Y, Wan P, Zhang A, Zhao W. New method for comparing DNA primary sequences based on a discrimination measure. J Theor Biol 2010;266:703-7. [PMID: 20688082 PMCID: PMC7094107 DOI: 10.1016/j.jtbi.2010.07.040] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2010] [Revised: 06/23/2010] [Accepted: 07/28/2010] [Indexed: 10/26/2022]

A web server for interactive and zoomable Chaos Game Representation images. SOURCE CODE FOR BIOLOGY AND MEDICINE 2009;4:6. [PMID: 19761591 PMCID: PMC2753581 DOI: 10.1186/1751-0473-4-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/05/2009] [Accepted: 09/17/2009] [Indexed: 11/10/2022]

Almeida JS, Vinga S. Biological sequences as pictures: a generic two dimensional solution for iterated maps. BMC Bioinformatics 2009;10:100. [PMID: 19335894 PMCID: PMC2678093 DOI: 10.1186/1471-2105-10-100] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2008] [Accepted: 03/31/2009] [Indexed: 11/24/2022] Open

Deschavanne P, Tufféry P. Exploring an alignment free approach for protein classification and structural class prediction. Biochimie 2007;90:615-25. [PMID: 18067866 DOI: 10.1016/j.biochi.2007.11.004] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2007] [Accepted: 11/09/2007] [Indexed: 11/25/2022]

Vinga S, Almeida JS. Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics 2007;8:393. [PMID: 17939871 PMCID: PMC2238722 DOI: 10.1186/1471-2105-8-393] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 10/16/2007] [Indexed: 11/18/2022] Open