1
|
Salgado-García R. Time-irreversibility test for random-length time series: The matching-time approach applied to DNA. CHAOS (WOODBURY, N.Y.) 2021; 31:123126. [PMID: 34972331 DOI: 10.1063/5.0062805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 12/01/2021] [Indexed: 06/14/2023]
Abstract
In this work, we implement the so-called matching-time estimators for estimating the entropy rate as well as the entropy production rate for symbolic sequences. These estimators are based on recurrence properties of the system, which have been shown to be appropriate for testing irreversibility, especially when the sequences have large correlations or memory. Based on limit theorems for matching times, we derive a maximum likelihood estimator for the entropy rate by assuming that we have a set of moderately short symbolic time series of finite random duration. We show that the proposed estimator has several properties that make it adequate for estimating the entropy rate and entropy production rate (or for testing the irreversibility) when the sample sequences have different lengths, such as the coding sequences of DNA. We test our approach with controlled examples of Markov chains, non-linear chaotic maps, and linear and non-linear autoregressive processes. We also implement our estimators for genomic sequences to show that the degree of irreversibility of coding sequences in human DNA is significantly larger than that for the corresponding non-coding sequences.
Collapse
Affiliation(s)
- R Salgado-García
- Centro de Investigación en Ciencias-IICBA, Physics Department, Universidad Autónoma del Estado de Morelos, Avenida Universidad 1001, colonia Chamilpa, CP 62209, Cuernavaca Morelos, Mexico
| |
Collapse
|
2
|
Kang DY, DeYoung PN, Tantiongloc J, Coleman TP, Owens RL. Statistical uncertainty quantification to augment clinical decision support: a first implementation in sleep medicine. NPJ Digit Med 2021; 4:142. [PMID: 34593972 PMCID: PMC8484290 DOI: 10.1038/s41746-021-00515-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 09/13/2021] [Indexed: 11/09/2022] Open
Abstract
Machine learning has the potential to change the practice of medicine, particularly in areas that require pattern recognition (e.g. radiology). Although automated classification is unlikely to be perfect, few modern machine learning tools have the ability to assess their own classification confidence to recognize uncertainty that might need human review. Using automated single-channel sleep staging as a first implementation, we demonstrated that uncertainty information (as quantified using Shannon entropy) can be utilized in a "human in the loop" methodology to promote targeted review of uncertain sleep stage classifications on an epoch-by-epoch basis. Across 20 sleep studies, this feedback methodology proved capable of improving scoring agreement with the gold standard over automated scoring alone (average improvement in Cohen's Kappa of 0.28), in a fraction of the scoring time compared to full manual review (60% reduction). In summary, our uncertainty-based clinician-in-the-loop framework promotes the improvement of medical classification accuracy/confidence in a cost-effective and economically resourceful manner.
Collapse
Affiliation(s)
- Dae Y Kang
- Department of Medicine, Division of Pulmonary, Critical Care, & Sleep Medicine, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Pamela N DeYoung
- Department of Medicine, Division of Pulmonary, Critical Care, & Sleep Medicine, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Justin Tantiongloc
- Department of Computer Science & Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Todd P Coleman
- Department of Bioengineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
| | - Robert L Owens
- Department of Medicine, Division of Pulmonary, Critical Care, & Sleep Medicine, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA.
| |
Collapse
|
3
|
Entropy Estimation Using a Linguistic Zipf-Mandelbrot-Li Model for Natural Sequences. ENTROPY 2021; 23:e23091100. [PMID: 34573725 PMCID: PMC8468050 DOI: 10.3390/e23091100] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 08/14/2021] [Accepted: 08/19/2021] [Indexed: 11/17/2022]
Abstract
Entropy estimation faces numerous challenges when applied to various real-world problems. Our interest is in divergence and entropy estimation algorithms which are capable of rapid estimation for natural sequence data such as human and synthetic languages. This typically requires a large amount of data; however, we propose a new approach which is based on a new rank-based analytic Zipf–Mandelbrot–Li probabilistic model. Unlike previous approaches, which do not consider the nature of the probability distribution in relation to language; here, we introduce a novel analytic Zipfian model which includes linguistic constraints. This provides more accurate distributions for natural sequences such as natural or synthetic emergent languages. Results are given which indicates the performance of the proposed ZML model. We derive an entropy estimation method which incorporates the linguistic constraint-based Zipf–Mandelbrot–Li into a new non-equiprobable coincidence counting algorithm which is shown to be effective for tasks such as entropy rate estimation with limited data.
Collapse
|
4
|
Silva M, Pratas D, Pinho AJ. Efficient DNA sequence compression with neural networks. Gigascience 2020; 9:giaa119. [PMID: 33179040 PMCID: PMC7657843 DOI: 10.1093/gigascience/giaa119] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 08/19/2020] [Accepted: 10/02/2020] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. FINDINGS We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. CONCLUSIONS GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.
Collapse
Affiliation(s)
- Milton Silva
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Diogo Pratas
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Virology, University of Helsinki, Haartmaninkatu 3, 00014 Helsinki, Finland
| | - Armando J Pinho
- Institute of Electronics and Informatics Engineering of Aveiro, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
- Department of Electronics Telecommunications and Informatics, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| |
Collapse
|
5
|
Al-Okaily A, Almarri B, Al Yami S, Huang CH. Toward a Better Compression for DNA Sequences Using Huffman Encoding. J Comput Biol 2017; 24:280-288. [PMID: 27960065 PMCID: PMC5372760 DOI: 10.1089/cmb.2016.0151] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Due to the significant amount of DNA data that are being generated by next-generation sequencing machines for genomes of lengths ranging from megabases to gigabases, there is an increasing need to compress such data to a less space and a faster transmission. Different implementations of Huffman encoding incorporating the characteristics of DNA sequences prove to better compress DNA data. These implementations center on the concepts of selecting frequent repeats so as to force a skewed Huffman tree, as well as the construction of multiple Huffman trees when encoding. The implementations demonstrate improvements on the compression ratios for five genomes with lengths ranging from 5 to 50 Mbp, compared with the standard Huffman tree algorithm. The research hence suggests an improvement on all such DNA sequence compression algorithms that use the conventional Huffman encoding. The research suggests an improvement on all DNA sequence compression algorithms that use the conventional Huffman encoding. Accompanying software is publicly available (AL-Okaily, 2016 ).
Collapse
Affiliation(s)
- Anas Al-Okaily
- Computer Science and Engineering Department, University of Connecticut , Storrs, Connecticut
| | - Badar Almarri
- Computer Science and Engineering Department, University of Connecticut , Storrs, Connecticut
| | - Sultan Al Yami
- Computer Science and Engineering Department, University of Connecticut , Storrs, Connecticut
| | - Chun-Hsi Huang
- Computer Science and Engineering Department, University of Connecticut , Storrs, Connecticut
| |
Collapse
|
6
|
Eric PV, Gopalakrishnan G, Karunakaran M. An Optimal Seed Based Compression Algorithm for DNA Sequences. Adv Bioinformatics 2016; 2016:3528406. [PMID: 27555868 PMCID: PMC4983397 DOI: 10.1155/2016/3528406] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2015] [Revised: 05/09/2016] [Accepted: 06/19/2016] [Indexed: 11/26/2022] Open
Abstract
This paper proposes a seed based lossless compression algorithm to compress a DNA sequence which uses a substitution method that is similar to the LempelZiv compression scheme. The proposed method exploits the repetition structures that are inherent in DNA sequences by creating an offline dictionary which contains all such repeats along with the details of mismatches. By ensuring that only promising mismatches are allowed, the method achieves a compression ratio that is at par or better than the existing lossless DNA sequence compression algorithms.
Collapse
Affiliation(s)
- Pamela Vinitha Eric
- Department of Information Science and Engineering, Rajiv Gandhi Institute of Technology, Bangalore 560032, India
| | - Gopakumar Gopalakrishnan
- Department of Computer Science and Engineering, National Institute of Technology Calicut, Kerala 673601, India
| | - Muralikrishnan Karunakaran
- Department of Computer Science and Engineering, National Institute of Technology Calicut, Kerala 673601, India
| |
Collapse
|
7
|
Paci G, Cristadoro G, Monti B, Lenci M, Degli Esposti M, Castellani GC, Remondini D. Characterization of DNA methylation as a function of biological complexity via dinucleotide inter-distances. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2016; 374:rsta.2015.0227. [PMID: 26857665 DOI: 10.1098/rsta.2015.0227] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/23/2015] [Indexed: 06/05/2023]
Abstract
We perform a statistical study of the distances between successive occurrences of a given dinucleotide in the DNA sequence for a number of organisms of different complexity. Our analysis highlights peculiar features of the CG dinucleotide distribution in mammalian DNA, pointing towards a connection with the role of such dinucleotide in DNA methylation. While the CG distributions of mammals exhibit exponential tails with comparable parameters, the picture for the other organisms studied (e.g. fish, insects, bacteria and viruses) is more heterogeneous, possibly because in these organisms DNA methylation has different functional roles. Our analysis suggests that the distribution of the distances between CG dinucleotides provides useful insights into characterizing and classifying organisms in terms of methylation functionalities.
Collapse
Affiliation(s)
- Giulia Paci
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Giampaolo Cristadoro
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Barbara Monti
- Department of Pharmacy and Biotechnology, University of Bologna, Via S. Donato 15, Bologna 40127, Italy
| | - Marco Lenci
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Mirko Degli Esposti
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Gastone C Castellani
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Daniel Remondini
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| |
Collapse
|
8
|
Prediction of multi-target networks of neuroprotective compounds with entropy indices and synthesis, assay, and theoretical study of new asymmetric 1,2-rasagiline carbamates. Int J Mol Sci 2014; 15:17035-64. [PMID: 25255029 PMCID: PMC4200850 DOI: 10.3390/ijms150917035] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2014] [Revised: 08/19/2014] [Accepted: 08/21/2014] [Indexed: 11/25/2022] Open
Abstract
In a multi-target complex network, the links (Lij) represent the interactions between the drug (di) and the target (tj), characterized by different experimental measures (Ki, Km, IC50, etc.) obtained in pharmacological assays under diverse boundary conditions (cj). In this work, we handle Shannon entropy measures for developing a model encompassing a multi-target network of neuroprotective/neurotoxic compounds reported in the CHEMBL database. The model predicts correctly >8300 experimental outcomes with Accuracy, Specificity, and Sensitivity above 80%–90% on training and external validation series. Indeed, the model can calculate different outcomes for >30 experimental measures in >400 different experimental protocolsin relation with >150 molecular and cellular targets on 11 different organisms (including human). Hereafter, we reported by the first time the synthesis, characterization, and experimental assays of a new series of chiral 1,2-rasagiline carbamate derivatives not reported in previous works. The experimental tests included: (1) assay in absence of neurotoxic agents; (2) in the presence of glutamate; and (3) in the presence of H2O2. Lastly, we used the new Assessing Links with Moving Averages (ALMA)-entropy model to predict possible outcomes for the new compounds in a high number of pharmacological tests not carried out experimentally.
Collapse
|
9
|
Sardaraz M, Tahir M, Ikram AA, Bajwa H. SeqCompress: an algorithm for biological sequence compression. Genomics 2014; 104:225-8. [PMID: 25173568 DOI: 10.1016/j.ygeno.2014.08.007] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Revised: 08/03/2014] [Accepted: 08/19/2014] [Indexed: 11/17/2022]
Abstract
The growth of Next Generation Sequencing technologies presents significant research challenges, specifically to design bioinformatics tools that handle massive amount of data efficiently. Biological sequence data storage cost has become a noticeable proportion of total cost in the generation and analysis. Particularly increase in DNA sequencing rate is significantly outstripping the rate of increase in disk storage capacity, which may go beyond the limit of storage capacity. It is essential to develop algorithms that handle large data sets via better memory management. This article presents a DNA sequence compression algorithm SeqCompress that copes with the space complexity of biological sequences. The algorithm is based on lossless data compression and uses statistical model as well as arithmetic coding to compress DNA sequences. The proposed algorithm is compared with recent specialized compression tools for biological sequences. Experimental results show that proposed algorithm has better compression gain as compared to other existing algorithms.
Collapse
Affiliation(s)
- Muhammad Sardaraz
- Department of Computing and Technology, Iqra University, Islamabad, Pakistan.
| | - Muhammad Tahir
- Department of Computer Science, University of Wah, Wah Cantt, Pakistan.
| | - Ataul Aziz Ikram
- Department of Electrical Engineering, National University, Islamabad, Pakistan.
| | - Hassan Bajwa
- Department of Electrical Engineering, University of Bridgeport, Bridgeport, USA.
| |
Collapse
|
10
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
11
|
González-Díaz H, Herrera-Ibatá DM, Duardo-Sánchez A, Munteanu CR, Orbegozo-Medina RA, Pazos A. ANN Multiscale Model of Anti-HIV Drugs Activity vs AIDS Prevalence in the US at County Level Based on Information Indices of Molecular Graphs and Social Networks. J Chem Inf Model 2014; 54:744-55. [DOI: 10.1021/ci400716y] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Humberto González-Díaz
- Department
of Organic Chemistry II, Faculty of Science and Technology, University of the Basque Country UPV/EHU, 48940, Leioa, Vizcaya, Spain
- IKERBASQUE, Basque
Foundation for Science, 48011, Bilbao, Vizcaya, Spain
| | - Diana María Herrera-Ibatá
- Department of Information and Communication Technologies, University of A Coruña UDC, 15071, A Coruña, A Coruña, Spain
| | - Aliuska Duardo-Sánchez
- Department of Information and Communication Technologies, University of A Coruña UDC, 15071, A Coruña, A Coruña, Spain
| | - Cristian R. Munteanu
- Department of Information and Communication Technologies, University of A Coruña UDC, 15071, A Coruña, A Coruña, Spain
| | - Ricardo Alfredo Orbegozo-Medina
- Department
of Microbiology and Parasitology, University of Santiago de Compostela (USC), 15782, Santiago de Compostela, A Coruña, Spain
| | - Alejandro Pazos
- Department of Information and Communication Technologies, University of A Coruña UDC, 15071, A Coruña, A Coruña, Spain
| |
Collapse
|
12
|
|
13
|
Pinho AJ, Ferreira PJSG, Neves AJR, Bastos CAC. On the representability of complete genomes by multiple competing finite-context (Markov) models. PLoS One 2011; 6:e21588. [PMID: 21738720 PMCID: PMC3128062 DOI: 10.1371/journal.pone.0021588] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2010] [Accepted: 06/06/2011] [Indexed: 11/19/2022] Open
Abstract
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.
Collapse
Affiliation(s)
- Armando J Pinho
- Signal Processing Lab, IEETA/DETI, University of Aveiro, Aveiro, Portugal.
| | | | | | | |
Collapse
|
14
|
Bose R, Chouhan S. Alternate measure of information useful for DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 83:051918. [PMID: 21728582 DOI: 10.1103/physreve.83.051918] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/27/2010] [Revised: 03/07/2011] [Indexed: 05/31/2023]
Abstract
We propose an alternate measure of information, called superinformation, which has been found to be very effective for analyzing the coding and noncoding regions of the DNA. This superinformation is actually a measure of the "randomness of randomness." It has been found to be highly accurate in classifying coding and noncoding regions of human DNA. In the proposed method, no prior training is required. This technique exhibits higher accuracy than previously reported techniques in distinguishing between the coding and the noncoding portions of the DNA. Superinformation can also be used to analyze the untranslated regions in various genes.
Collapse
Affiliation(s)
- Ranjan Bose
- Department of Electrical Engineering, IIT Delhi, Hauz Khas, New Delhi, India
| | | |
Collapse
|
15
|
|
16
|
Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. ENTROPY 2010. [DOI: 10.3390/e12071765] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
17
|
Giancarlo R, Scaturro D, Utro F. Textual data compression in computational biology: a synopsis. Bioinformatics 2009; 25:1575-86. [DOI: 10.1093/bioinformatics/btp117] [Citation(s) in RCA: 63] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
|
18
|
Estimating the Entropy of Binary Time Series: Methodology, Some Theory and a Simulation Study. ENTROPY 2008. [DOI: 10.3390/entropy-e10020071] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
19
|
Cruz-Monteagudo M, González-Díaz H, Borges F, Dominguez ER, Cordeiro MNDS. 3D-MEDNEs: an alternative "in silico" technique for chemical research in toxicology. 2. quantitative proteome-toxicity relationships (QPTR) based on mass spectrum spiral entropy. Chem Res Toxicol 2008; 21:619-32. [PMID: 18257557 DOI: 10.1021/tx700296t] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Low range mass spectra (MS) characterization of serum proteome offers the best chance of discovering proteome-(early drug-induced cardiac toxicity) relationships, called here Pro-EDICToRs. However, due to the thousands of proteins involved, finding the single disease-related protein could be a hard task. The search for a model based on general MS patterns becomes a more realistic choice. In our previous work ( González-Díaz, H. , et al. Chem. Res. Toxicol. 2003, 16, 1318- 1327 ), we introduced the molecular structure information indices called 3D-Markovian electronic delocalization entropies (3D-MEDNEs). In this previous work, quantitative structure-toxicity relationship (QSTR) techniques allowed us to link 3D-MEDNEs with blood toxicological properties of drugs. In this second part, we extend 3D-MEDNEs to numerically encode biologically relevant information present in MS of the serum proteome for the first time. Using the same idea behind QSTR techniques, we can seek now by analogy a quantitative proteome-toxicity relationship (QPTR). The new QPTR models link MS 3D-MEDNEs with drug-induced toxicological properties from blood proteome information. We first generalized Randic's spiral graph and lattice networks of protein sequences to represent the MS of 62 serum proteome samples with more than 370 100 intensity ( I i ) signals with m/ z bandwidth above 700-12000 each. Next, we calculated the 3D-MEDNEs for each MS using the software MARCH-INSIDE. After that, we developed several QPTR models using different machine learning and MS representation algorithms to classify samples as control or positive Pro-EDICToRs samples. The best QPTR proposed showed accuracy values ranging from 83.8% to 87.1% and leave-one-out (LOO) predictive ability of 77.4-85.5%. This work demonstrated that the idea behind classic drug QSTR models may be extended to construct QPTRs with proteome MS data.
Collapse
Affiliation(s)
- Maykel Cruz-Monteagudo
- Physico-Chemical Molecular Research Unit, Department of Organic Chemistry, Faculty of Pharmacy, University of Porto, 4150-047 Porto, Portugal
| | | | | | | | | |
Collapse
|
20
|
Dix TI, Powell DR, Allison L, Bernal J, Jaeger S, Stern L. Comparative analysis of long DNA sequences by per element information content using different contexts. BMC Bioinformatics 2007; 8 Suppl 2:S10. [PMID: 17493248 PMCID: PMC1892068 DOI: 10.1186/1471-2105-8-s2-s10] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Features of a DNA sequence can be found by compressing the sequence under a suitable model; good compression implies low information content. Good DNA compression models consider repetition, differences between repeats, and base distributions. From a linear DNA sequence, a compression model can produce a linear information sequence. Linear space complexity is important when exploring long DNA sequences of the order of millions of bases. Compressing a sequence in isolation will include information on self-repetition. Whereas compressing a sequence Y in the context of another X can find what new information X gives about Y. This paper presents a methodology for performing comparative analysis to find features exposed by such models. RESULTS We apply such a model to find features across chromosomes of Cyanidioschyzon merolae. We present a tool that provides useful linear transformations to investigate and save new sequences. Various examples illustrate the methodology, finding features for sequences alone and in different contexts. We also show how to highlight all sets of self-repetition features, in this case within Plasmodium falciparum chromosome 2. CONCLUSION The methodology finds features that are significant and that biologists confirm. The exploration of long information sequences in linear time and space is fast and the saved results are self documenting.
Collapse
Affiliation(s)
- Trevor I Dix
- Faculty of Information Technology, Monash University, Clayton, 3800, Australia
- Victorian Bioinformatics Consortium, Monash University, Clayton, 3800, Australia
| | - David R Powell
- Faculty of Information Technology, Monash University, Clayton, 3800, Australia
- Victorian Bioinformatics Consortium, Monash University, Clayton, 3800, Australia
| | - Lloyd Allison
- Faculty of Information Technology, Monash University, Clayton, 3800, Australia
| | - Julie Bernal
- Faculty of Information Technology, Monash University, Clayton, 3800, Australia
| | - Samira Jaeger
- Faculty of Information Technology, Monash University, Clayton, 3800, Australia
| | - Linda Stern
- Computer Science and Software Engineering, University of Melbourne, Melbourne, 3010, Australia
| |
Collapse
|
21
|
|
22
|
Sadovsky MG. Information capacity of nucleotide sequences and its applications. Bull Math Biol 2006; 68:785-806. [PMID: 16802083 DOI: 10.1007/s11538-005-9017-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2004] [Accepted: 03/10/2005] [Indexed: 10/24/2022]
Abstract
The information capacity of nucleotide sequences is defined through the specific entropy of frequency dictionary of a sequence determined with respect to another one containing the most probable continuations of shorter strings. This measure distinguishes a sequence both from a random one, and from ordered entity. A comparison of sequences based on their information capacity is studied. An order within the genetic entities is found at the length scale ranged from 3 to 8. Some other applications of the developed methodology to genetics, bioinformatics, and molecular biology are discussed.
Collapse
Affiliation(s)
- M G Sadovsky
- Institute of Biophysics of Siberian Division of Russian Academy of Sciences, Akademgorodok, Krasnoyarsk, 660036, Russia.
| |
Collapse
|
23
|
Vinga S, Almeida JS. Rényi continuous entropy of DNA sequences. J Theor Biol 2004; 231:377-88. [PMID: 15501469 DOI: 10.1016/j.jtbi.2004.06.030] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2004] [Accepted: 06/30/2004] [Indexed: 11/20/2022]
Abstract
Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.
Collapse
Affiliation(s)
- Susana Vinga
- Biomathematics Group, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, R. Qta. Grande 6, 2780-156 Oeiras, Portugal.
| | | |
Collapse
|
24
|
Sadovsky MG. Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae. J Biol Phys 2003; 29:23-38. [PMID: 23345817 PMCID: PMC3456843 DOI: 10.1023/a:1022554613105] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The information capacity of nucleotide sequences is defined through the calculation of specific entropy of their frequency dictionary. The specificentropy of the frequency dictionary is calculated against the reconstructeddictionary; this latter bears the most probable continuations of the shorterstrings. This developed measure allows to distinguish the sequences both from the randons ones, and from those with high level of (rather simple) order. Some implications of the developed methodology in the fields of genetics,bioinformatics, and molecular biology are discussed.
Collapse
Affiliation(s)
- Michael G. Sadovsky
- Division of Russian Academy of Sciences, Institute of Biophysics of Siberian, Akademgorodok, Krasnoyarsk, 660036
| |
Collapse
|
25
|
Abstract
A circular code has been identified in the protein (coding) genes of both eukaryotes and prokaryotes by using a statistical method called trinucleotide frequency (TF) method [Arquès & Michel (1996). J. theor. Biol. 182, 45-58]. Recently, a probabilistic model based on the nucleotide frequencies with a hypothesis of absence of correlation between successive bases on a DNA strand, has been proposed by Koch & Lehmann [(1997). J. theor. Biol. 189, 171-174] for constructing some particular circular codes. Their interesting method which we call here nucleotide frequency (NF) method, reveals several limits for constructing the circular code observed with protein genes.
Collapse
Affiliation(s)
- J Lacan
- Laboratoire d'Informatique de Franche-Comté, Université de Franche-Comté, IUT de Belfort-Montbéliard, Montbéliard, France.
| | | |
Collapse
|
26
|
Wan H, Wootton JC. A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. ACTA ACUST UNITED AC 2000. [DOI: 10.1016/s0097-8485(00)80008-x] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|