Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Allison L, Stern L, Edgoose T, Dix T. Sequence complexity for biological sequence analysis. ACTA ACUST UNITED AC 2000;24:43-55. [DOI: 10.1016/s0097-8485(00)80006-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]

Number

Cited by Other Article(s)

Silva JM, Qi W, Pinho AJ, Pratas D. AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data. Gigascience 2022;12:giad101. [PMID: 38091509 PMCID: PMC10716826 DOI: 10.1093/gigascience/giad101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 09/29/2023] [Accepted: 11/07/2023] [Indexed: 12/18/2023] Open

Abstract

BACKGROUND

Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations.

FINDINGS

This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar.

CONCLUSIONS

The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.

Collapse

Eric PV, Gopalakrishnan G, Karunakaran M. An Optimal Seed Based Compression Algorithm for DNA Sequences. Adv Bioinformatics 2016;2016:3528406. [PMID: 27555868 PMCID: PMC4983397 DOI: 10.1155/2016/3528406] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Revised: 05/09/2016] [Accepted: 06/19/2016] [Indexed: 11/26/2022] Open

Pratas D, Pinho AJ, Rodrigues JMOS. XS: a FASTQ read simulator. BMC Res Notes 2014;7:40. [PMID: 24433564 PMCID: PMC3927261 DOI: 10.1186/1756-0500-7-40] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2013] [Accepted: 12/18/2013] [Indexed: 12/31/2022] Open

Pinho AJ, Garcia SP, Pratas D, Ferreira PJSG. DNA sequences at a glance. PLoS One 2013;8:e79922. [PMID: 24278218 PMCID: PMC3836782 DOI: 10.1371/journal.pone.0079922] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2012] [Accepted: 09/30/2013] [Indexed: 11/20/2022] Open

Abstract

Data summarization and triage is one of the current top challenges in visual analytics. The goal is to let users visually inspect large data sets and examine or request data with particular characteristics. The need for summarization and visual analytics is also felt when dealing with digital representations of DNA sequences. Genomic data sets are growing rapidly, making their analysis increasingly more difficult, and raising the need for new, scalable tools. For example, being able to look at very large DNA sequences while immediately identifying potentially interesting regions would provide the biologist with a flexible exploratory and analytical tool. In this paper we present a new concept, the "information profile", which provides a quantitative measure of the local complexity of a DNA sequence, independently of the direction of processing. The computation of the information profiles is computationally tractable: we show that it can be done in time proportional to the length of the sequence. We also describe a tool to compute the information profiles of a given DNA sequence, and use the genome of the fission yeast Schizosaccharomyces pombe strain 972 h(-) and five human chromosomes 22 for illustration. We show that information profiles are useful for detecting large-scale genomic regularities by visual inspection. Several discovery strategies are possible, including the standalone analysis of single sequences, the comparative analysis of sequences from individuals from the same species, and the comparative analysis of sequences from different organisms. The comparison scale can be varied, allowing the users to zoom-in on specific details, or obtain a broad overview of a long segment. Software applications have been made available for non-commercial use at http://bioinformatics.ua.pt/software/dna-at-glance.

Collapse

Scientific Élan Vital: Entropy Deficit or Inhomogeneity as a Unified Concept of Driving Forces of Life in Hierarchical Biosphere Driven by Photosynthesis. ENTROPY 2012. [DOI: 10.3390/e14020233] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Dix TI, Powell DR, Allison L, Bernal J, Jaeger S, Stern L. Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 2007;8 Suppl 2:S10. [PMID: 17493248 PMCID: PMC1892068 DOI: 10.1186/1471-2105-8-s2-s10] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Compression-based data mining of sequential data. Data Min Knowl Discov 2007. [DOI: 10.1007/s10618-006-0049-3] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]

Abel DL, Trevors JT. Three subsets of sequence complexity and their relevance to biopolymeric information. Theor Biol Med Model 2005;2:29. [PMID: 16095527 PMCID: PMC1208958 DOI: 10.1186/1742-4682-2-29] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Accepted: 08/11/2005] [Indexed: 11/24/2022] Open

Ané C, Sanderson M. Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories. Syst Biol 2005;54:146-57. [PMID: 15805016 DOI: 10.1080/10635150590905984] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022] Open

Xin C, Sam K, Ming L. A compression algorithm for DNA sequences. IEEE ENGINEERING IN MEDICINE AND BIOLOGY MAGAZINE : THE QUARTERLY MAGAZINE OF THE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY 2001;20:61-6. [PMID: 11494771 DOI: 10.1109/51.940049] [Citation(s) in RCA: 60] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]

Current Awareness on Comparative and Functional Genomics. Yeast 2000. [DOI: 10.1002/1097-0061(20000630)17:2<159::aid-yea8>3.0.co;2-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open