Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Nalbantog̃lu OU, Russell DJ, Sayood K. Data Compression Concepts and Algorithms and their Applications to Bioinformatics. Entropy (Basel) 2010;12:34. [PMID: 20157640 DOI: 10.3390/e12010034] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]

For:	Nalbantog̃lu OU, Russell DJ, Sayood K. Data Compression Concepts and Algorithms and their Applications to Bioinformatics. Entropy (Basel) 2010;12:34. [PMID: 20157640 DOI: 10.3390/e12010034] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]

Number

Cited by Other Article(s)

Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021;23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models. ENTROPY 2021;23:e23050530. [PMID: 33925812 PMCID: PMC8146440 DOI: 10.3390/e23050530] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/01/2021] [Revised: 04/19/2021] [Accepted: 04/22/2021] [Indexed: 12/28/2022]

Sy P, Nagaraj N. Causal discovery using compression-complexity measures. J Biomed Inform 2021;117:103724. [PMID: 33722730 DOI: 10.1016/j.jbi.2021.103724] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 12/23/2020] [Accepted: 02/22/2021] [Indexed: 12/30/2022]

Kredens KV, Martins JV, Dordal OB, Ferrandin M, Herai RH, Scalabrin EE, Ávila BC. Vertical lossless genomic data compression tools for assembled genomes: A systematic literature review. PLoS One 2020;15:e0232942. [PMID: 32453750 PMCID: PMC7250429 DOI: 10.1371/journal.pone.0232942] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Accepted: 04/25/2020] [Indexed: 11/19/2022] Open

Storage Space Allocation Strategy for Digital Data with Message Importance. ENTROPY 2020;22:e22050591. [PMID: 33286363 PMCID: PMC7517127 DOI: 10.3390/e22050591] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Accepted: 05/19/2020] [Indexed: 11/30/2022]

Hosseini M, Pratas D, Pinho AJ. AC: A Compression Tool for Amino Acid Sequences. Interdiscip Sci 2019;11:68-76. [PMID: 30721401 DOI: 10.1007/s12539-019-00322-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 01/23/2019] [Accepted: 01/28/2019] [Indexed: 10/27/2022]

Luo Q, Guo C, Zhang YJ, Cai Y, Liu G. Algorithms designed for compressed-gene-data transformation among gene banks with different references. BMC Bioinformatics 2018;19:230. [PMID: 29914357 PMCID: PMC6006589 DOI: 10.1186/s12859-018-2230-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2017] [Accepted: 06/04/2018] [Indexed: 11/12/2022] Open

Wandelt S, Leser U. Sequence Factorization with Multiple References. PLoS One 2015;10:e0139000. [PMID: 26422374 PMCID: PMC4589410 DOI: 10.1371/journal.pone.0139000] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2014] [Accepted: 09/07/2015] [Indexed: 11/29/2022] Open

Abstract

The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.

Collapse

Handelman SK, Seweryn M, Smith RM, Hartmann K, Wang D, Pietrzak M, Johnson AD, Kloczkowski A, Sadee W. Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci (eQTLs). BMC Genomics 2015;16 Suppl 8:S8. [PMID: 26111110 PMCID: PMC4480832 DOI: 10.1186/1471-2164-16-s8-s8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open

Abstract

BACKGROUND

Over the past 50,000 years, shifts in human-environmental or human-human interactions shaped genetic differences within and among human populations, including variants under positive selection. Shaped by environmental factors, such variants influence the genetics of modern health, disease, and treatment outcome. Because evolutionary processes tend to act on gene regulation, we test whether regulatory variants are under positive selection. We introduce a new approach to enhance detection of genetic markers undergoing positive selection, using conditional entropy to capture recent local selection signals.

RESULTS

We use conditional logistic regression to compare our Adjusted Haplotype Conditional Entropy (H|H) measure of positive selection to existing positive selection measures. H|H and existing measures were applied to published regulatory variants acting in cis (cis-eQTLs), with conditional logistic regression testing whether regulatory variants undergo stronger positive selection than the surrounding gene. These cis-eQTLs were drawn from six independent studies of genotype and RNA expression. The conditional logistic regression shows that, overall, H|H is substantially more powerful than existing positive-selection methods in identifying cis-eQTLs against other Single Nucleotide Polymorphisms (SNPs) in the same genes. When broken down by Gene Ontology, H|H predictions are particularly strong in some biological process categories, where regulatory variants are under strong positive selection compared to the bulk of the gene, distinct from those GO categories under overall positive selection. . However, cis-eQTLs in a second group of genes lack positive selection signatures detectable by H|H, consistent with ancient short haplotypes compared to the surrounding gene (for example, in innate immunity GO:0042742); under such other modes of selection, H|H would not be expected to be a strong predictor.. These conditional logistic regression models are adjusted for Minor allele frequency(MAF); otherwise, ascertainment bias is a huge factor in all eQTL data sets. Relationships between Gene Ontology categories, positive selection and eQTL specificity were replicated with H|H in a single larger data set. Our measure, Adjusted Haplotype Conditional Entropy (H|H), was essential in generating all of the results above because it: 1) is a stronger overall predictor for eQTLs than comparable existing approaches, and 2) shows low sequential auto-correlation, overcoming problems with convergence of these conditional regression statistical models.

CONCLUSIONS

Our new method, H|H, provides a consistently more robust signal associated with cis-eQTLs compared to existing methods. We interpret this to indicate that some cis-eQTLs are under positive selection compared to their surrounding genes. Conditional entropy indicative of a selective sweep is an especially strong predictor of eQTLs for genes in several biological processes of medical interest. Where conditional entropy is a weak or negative predictor of eQTLs, such as innate immune genes, this would be consistent with balancing selection acting on such eQTLs over long time periods. Different measures of selection may be needed for variant prioritization under other modes of evolutionary selection.

Collapse

Affiliation(s)

Samuel K Handelman Center for Pharmacogenomics, The Ohio State University College of Medicine, Graves Hall, 330 W. 10th Ave., Columbus, OH 43210, USA
Michal Seweryn Mathematical Biosciences Institute, Jennings Hall 3rd Floor, 1735 Neil Ave., Columbus, OH 43210, USA Faculty of Mathematics and Computer Science, Łódź University, Narutowicza 65, 90-131 Łódź, Poland Division of Biostatistics, The Ohio State University College of Public Health, Cunz Hall, 1841 Neil Avenue, Columbus, OH 43210-1240, USA
Ryan M Smith Center for Pharmacogenomics, The Ohio State University College of Medicine, Graves Hall, 330 W. 10th Ave., Columbus, OH 43210, USA
Katherine Hartmann Center for Pharmacogenomics, The Ohio State University College of Medicine, Graves Hall, 330 W. 10th Ave., Columbus, OH 43210, USA
Danxin Wang Center for Pharmacogenomics, The Ohio State University College of Medicine, Graves Hall, 330 W. 10th Ave., Columbus, OH 43210, USA
Maciej Pietrzak Center for Pharmacogenomics, The Ohio State University College of Medicine, Graves Hall, 330 W. 10th Ave., Columbus, OH 43210, USA Division of Biostatistics, The Ohio State University College of Public Health, Cunz Hall, 1841 Neil Avenue, Columbus, OH 43210-1240, USA
Andrew D Johnson Division of Intramural Research, National Heart, Lung and Blood Institute, Cardiovascular Epidemiology and Human Genomics Branch, The Framingham Heart Study, 73 Mt. Wayte Ave., Suite #2, Framingham, MA, USA
Andrzej Kloczkowski Battelle Center for Mathematical Medicine, Nationwide Children's Hospital, 700 Children's Drive, Columbus OH 43205, USA Department of Pediatrics, The Ohio State University College of Medicine, 700 Children's Drive, Columbus OH 43205, USA Kavli Institute for Theoretical Physics China, Chinese Academy of Sciences, Beijing 100190, China
Wolfgang Sadee Center for Pharmacogenomics, The Ohio State University College of Medicine, Graves Hall, 330 W. 10th Ave., Columbus, OH 43210, USA

Collapse

Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014;15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open

Shockley KR. Using weighted entropy to rank chemicals in quantitative high-throughput screening experiments. ACTA ACUST UNITED AC 2013;19:344-53. [PMID: 24056003 DOI: 10.1177/1087057113505325] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Wandelt S, Leser U. FRESCO: Referential compression of highly similar sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013;10:1275-1288. [PMID: 24524158 DOI: 10.1109/tcbb.2013.122] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]

Abstract

In many applications, sets of similar texts or sequences are of high importance. Prominent examples are revision histories of documents or genomic sequences. Modern high-throughput sequencing technologies are able to generate DNA sequences at an ever-increasing rate. In parallel to the decreasing experimental time and cost necessary to produce DNA sequences, computational requirements for analysis and storage of the sequences are steeply increasing. Compression is a key technology to deal with this challenge. Recently, referential compression schemes, storing only the differences between a to-be-compressed input and a known reference sequence, gained a lot of interest in this field. In this paper, we propose a general open-source framework to compress large amounts of biological sequence data called Framework for REferential Sequence COmpression (FRESCO). Our basic compression algorithm is shown to be one to two orders of magnitudes faster than comparable related work, while achieving similar compression ratios. We also propose several techniques to further increase compression ratios, while still retaining the advantage in speed: 1) selecting a good reference sequence; and 2) rewriting a reference sequence to allow for better compression. In addition,we propose a new way of further boosting the compression ratios by applying referential compression to already referentially compressed files (second-order compression). This technique allows for compression ratios way beyond state of the art, for instance,4,000:1 and higher for human genomes. We evaluate our algorithms on a large data set from three different species (more than 1,000 genomes, more than 3 TB) and on a collection of versions of Wikipedia pages. Our results show that real-time compression of highly similar sequences at high compression ratios is possible on modern hardware.

Collapse

HydroZIP: How Hydrological Knowledge can Be Used to Improve Compression of Hydrological Data. ENTROPY 2013. [DOI: 10.3390/e15041289] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Wandelt S, Leser U. Adaptive efficient compression of genomes. Algorithms Mol Biol 2012;7:30. [PMID: 23146997 PMCID: PMC3541066 DOI: 10.1186/1748-7188-7-30] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2012] [Accepted: 10/26/2012] [Indexed: 12/02/2022] Open

Mian IS, Rose C. Communication theory and multicellular biology. Integr Biol (Camb) 2011;3:350-67. [PMID: 21424025 DOI: 10.1039/c0ib00117a] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]

Abstract

In this Perspective, we propose that communication theory--a field of mathematics concerned with the problems of signal transmission, reception and processing--provides a new quantitative lens for investigating multicellular biology, ancient and modern. What underpins the cohesive organisation and collective behaviour of multicellular ecosystems such as microbial colonies and communities (microbiomes) and multicellular organisms such as plants and animals, whether built of simple tissue layers (sponges) or of complex differentiated cells arranged in tissues and organs (members of the 35 or so phyla of the subkingdom Metazoa)? How do mammalian tissues and organs develop, maintain their architecture, become subverted in disease, and decline with age? How did single-celled organisms coalesce to produce many-celled forms that evolved and diversified into the varied multicellular organisms in existence today? Some answers can be found in the blueprints or recipes encoded in (epi)genomes, yet others lie in the generic physical properties of biological matter such as the ability of cell aggregates to attain a certain complexity in size, shape, and pattern. We suggest that Lasswell's maxim "Who says what to whom in what channel with what effect" provides a foundation for understanding not only the emergence and evolution of multicellularity, but also the assembly and sculpting of multicellular ecosystems and many-celled structures, whether of natural or human-engineered origin. We explore how the abstraction of communication theory as an organising principle for multicellular biology could be realised. We highlight the inherent ability of communication theory to be blind to molecular and/or genetic mechanisms. We describe selected applications that analyse the physics of communication and use energy efficiency as a central tenet. Whilst communication theory has and could contribute to understanding a myriad of problems in biology, investigations of multicellular biology could, in turn, lead to advances in communication theory, especially in the still immature field of network information theory.

Collapse