1
|
Zeng Q, Li R, Wang J. Improvement of Error Correction in Nonequilibrium Information Dynamics. ENTROPY (BASEL, SWITZERLAND) 2023; 25:881. [PMID: 37372225 DOI: 10.3390/e25060881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/22/2023] [Revised: 05/22/2023] [Accepted: 05/22/2023] [Indexed: 06/29/2023]
Abstract
Errors are inevitable in information processing and transfer. While error correction is widely studied in engineering, the underlying physics is not fully understood. Due to the complexity and energy exchange involved, information transmission should be considered as a nonequilibrium process. In this study, we investigate the effects of nonequilibrium dynamics on error correction using a memoryless channel model. Our findings suggest that error correction improves as nonequilibrium increases, and the thermodynamic cost can be utilized to improve the correction quality. Our results inspire new approaches to error correction that incorporate nonequilibrium dynamics and thermodynamics, and highlight the importance of the nonequilibrium effects in error correction design, particularly in biological systems.
Collapse
Affiliation(s)
- Qian Zeng
- State Key Laboratory of Electroanalytical Chemistry, Changchun Institute of Applied Chemistry, Changchun 130022, China
| | - Ran Li
- Center for Theoretical Interdisciplinary Sciences, Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou 325001, China
| | - Jin Wang
- Department of Chemistry and Physics, State University of New York, Stony Brook, NY 11794, USA
| |
Collapse
|
2
|
Nandy A. Mapping Biomolecular Sequences: Graphical Representations - their Origins, Applications and Future Prospects. Comb Chem High Throughput Screen 2021; 25:354-364. [PMID: 33970841 DOI: 10.2174/1386207324666210510164743] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Revised: 01/25/2021] [Accepted: 02/11/2021] [Indexed: 11/22/2022]
Abstract
The exponential growth in the depositories of biological sequence data have generated an urgent need to store, retrieve and analyse the data efficiently and effectively for which the standard practice of using alignment procedures are not adequate due to high demand on computing resources and time. Graphical representation of sequences has become one of the most popular alignment-free strategies to analyse the biological sequences where each basic unit of the sequences - the bases adenine, cytosine, guanine and thymine for DNA/RNA, and the 20 amino acids for proteins - are plotted on a multi-dimensional grid. The resulting curve in 2D and 3D space and the implied graph in higher dimensions provide a perception of the underlying information of the sequences through visual inspection; numerical analyses, in geometrical or matrix terms, of the plots provide a measure of comparison between sequences and thus enable study of sequence hierarchies. The new approach has also enabled studies of comparisons of DNA sequences over many thousands of bases and provided new insights into the structure of the base compositions of DNA sequences In this article we review in brief the origins and applications of graphical representations and highlight the future perspectives in this field.
Collapse
Affiliation(s)
- Ashesh Nandy
- Centre for Interdisciplinary Research and Education, Kolkata 700068, India
| |
Collapse
|
3
|
Al Bataineh M, Al-qudah Z. A novel gene identification algorithm with Bayesian classification. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2016.07.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
4
|
Brunet TDP. Aims and methods of biosteganography. J Biotechnol 2016; 226:56-64. [PMID: 27021958 DOI: 10.1016/j.jbiotec.2016.03.044] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Revised: 03/21/2016] [Accepted: 03/23/2016] [Indexed: 12/31/2022]
Abstract
Applications of biotechnology to information security are now possible and have potentially far reaching political and technological implications. This change in information security practices, initiated by advancements in molecular biological and biotechnology, warrants reasonable and widespread consideration by biologists, biotechnologists and philosophers. I offer an explication of the landmark contributions, developments and current possibilities of biosteganography-the process of transmitting secure messages via biological mediums. I address, (i) how information can be stored and encoded in biological mediums, (ii) how biological mediums (e.g. DNA, RNA, protein) and storage systems (e.g. cells, biofilms, organisms) influence the nature of information security, and (iii) what constitutes a viable application of such biotechnologies.
Collapse
Affiliation(s)
- Tyler D P Brunet
- Computational Biology and Bioinformatics, Dalhousie University, 6050 University Avenue, Halifax, NS, Canada B3H 1W5.
| |
Collapse
|
5
|
Transmission of intra-cellular genetic information: a system proposal. J Theor Biol 2014; 358:208-31. [PMID: 24928152 DOI: 10.1016/j.jtbi.2014.05.040] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2013] [Revised: 05/05/2014] [Accepted: 05/27/2014] [Indexed: 11/21/2022]
Abstract
One of the great challenges of the scientific community on theories of genetic information, genetic communication and genetic coding is to determine a mathematical structure related to DNA sequences. In this paper we propose a model of an intra-cellular transmission system of genetic information similar to a model of a power and bandwidth efficient digital communication system in order to identify a mathematical structure in DNA sequences where such sequences are biologically relevant. The model of a transmission system of genetic information is concerned with the identification, reproduction and mathematical classification of the nucleotide sequence of single stranded DNA by the genetic encoder. Hence, a genetic encoder is devised where labelings and cyclic codes are established. The establishment of the algebraic structure of the corresponding codes alphabets, mappings, labelings, primitive polynomials (p(x)) and code generator polynomials (g(x)) are quite important in characterizing error-correcting codes subclasses of G-linear codes. These latter codes are useful for the identification, reproduction and mathematical classification of DNA sequences. The characterization of this model may contribute to the development of a methodology that can be applied in mutational analysis and polymorphisms, production of new drugs and genetic improvement, among other things, resulting in the reduction of time and laboratory costs.
Collapse
|
6
|
Liu X, Geng X. A convolutional code-based sequence analysis model and its application. Int J Mol Sci 2013; 14:8393-405. [PMID: 23591850 PMCID: PMC3645750 DOI: 10.3390/ijms14048393] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Revised: 03/28/2013] [Accepted: 04/10/2013] [Indexed: 11/16/2022] Open
Abstract
A new approach for encoding DNA sequences as input for DNA sequence analysis is proposed using the error correction coding theory of communication engineering. The encoder was designed as a convolutional code model whose generator matrix is designed based on the degeneracy of codons, with a codon treated in the model as an informational unit. The utility of the proposed model was demonstrated through the analysis of twelve prokaryote and nine eukaryote DNA sequences having different GC contents. Distinct differences in code distances were observed near the initiation and termination sites in the open reading frame, which provided a well-regulated characterization of the DNA sequences. Clearly distinguished period-3 features appeared in the coding regions, and the characteristic average code distances of the analyzed sequences were approximately proportional to their GC contents, particularly in the selected prokaryotic organisms, presenting the potential utility as an added taxonomic characteristic for use in studying the relationships of living organisms.
Collapse
Affiliation(s)
- Xiao Liu
- College of Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing 400044, China; E-Mail:
| | - Xiaoli Geng
- College of Communication Engineering, Chongqing University, 174 ShaPingBa District, Chongqing 400044, China; E-Mail:
| |
Collapse
|
7
|
Faria LCB, Rocha ASL, Kleinschmidt JH, Silva-Filho MC, Bim E, Herai RH, Yamagishi MEB, Palazzo R. Is a genome a codeword of an error-correcting code? PLoS One 2012; 7:e36644. [PMID: 22649495 PMCID: PMC3359345 DOI: 10.1371/journal.pone.0036644] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2011] [Accepted: 04/04/2012] [Indexed: 11/19/2022] Open
Abstract
Since a genome is a discrete sequence, the elements of which belong to a set of four letters, the question as to whether or not there is an error-correcting code underlying DNA sequences is unavoidable. The most common approach to answering this question is to propose a methodology to verify the existence of such a code. However, none of the methodologies proposed so far, although quite clever, has achieved that goal. In a recent work, we showed that DNA sequences can be identified as codewords in a class of cyclic error-correcting codes known as Hamming codes. In this paper, we show that a complete intron-exon gene, and even a plasmid genome, can be identified as a Hamming code codeword as well. Although this does not constitute a definitive proof that there is an error-correcting code underlying DNA sequences, it is the first evidence in this direction.
Collapse
Affiliation(s)
- Luzinete C. B. Faria
- Departamento de Telemática, Universidade Estadual de Campinas, Campinas, São Paulo, Brazil
| | - Andréa S. L. Rocha
- Departamento de Telemática, Universidade Estadual de Campinas, Campinas, São Paulo, Brazil
| | - João H. Kleinschmidt
- Centro de Engenharia, Modelagem e Ciências Sociais Aplicadas, Universidade Federal do ABC, Santo André, São Paulo, Brazil
| | - Márcio C. Silva-Filho
- Departamento de Genética, Escola Superior de Agricultura Luiz de Queiroz, Universidade de São Paulo, São Paulo, Brazil
| | - Edson Bim
- Departamento de Sistema de Controle de Energia, Universidade Estadual de Campinas, Campinas, São Paulo, Brazil
| | - Roberto H. Herai
- Department of Cellular & Molecular Medicine, School of Medicine, University of California San Diego, La Jolla, California, United States of America
| | - Michel E. B. Yamagishi
- Embrapa Informática Agropecuária, Laboratório de Bioinformática Aplicada, Campinas, São Paulo, Brazil
| | - Reginaldo Palazzo
- Departamento de Telemática, Universidade Estadual de Campinas, Campinas, São Paulo, Brazil
| |
Collapse
|
8
|
Nandy A. Empirical relationship between intra-purine and intra-pyrimidine differences in conserved gene sequences. PLoS One 2009; 4:e6829. [PMID: 19714250 PMCID: PMC2730015 DOI: 10.1371/journal.pone.0006829] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2009] [Accepted: 07/24/2009] [Indexed: 11/18/2022] Open
Abstract
DNA sequences seen in the normal character-based representation appear to have a formidable mixing of the four nucleotides without any apparent order. Nucleotide frequencies and distributions in the sequences have been studied extensively, since the simple rule given by Chargaff almost a century ago that equates the total number of purines to the pyrimidines in a duplex DNA sequence. While it is difficult to trace any relationship between the bases from studies in the character representation of a DNA sequence, graphical representations may provide a clue. These novel representations of DNA sequences have been useful in providing an overview of base distribution and composition of the sequences and providing insights into many hidden structures. We report here our observation based on a graphical representation that the intra-purine and intra-pyrimidine differences in sequences of conserved genes generally follow a quadratic distribution relationship and show that this may have arisen from mutations in the sequences over evolutionary time scales. From this hitherto undescribed relationship for the gene sequences considered in this report we hypothesize that such relationships may be characteristic of these sequences and therefore could become a barrier to large scale sequence alterations that override such characteristics, perhaps through some monitoring process inbuilt in the DNA sequences. Such relationship also raises the possibility of intron sequences playing an important role in maintaining the characteristics and could be indicative of possible intron-late phenomena.
Collapse
Affiliation(s)
- Ashesh Nandy
- School of Environmental Studies, Jadavpur University, Kolkata, West Bengal, India.
| |
Collapse
|
9
|
MacDónaill DA. Digital parity and the composition of the nucleotide alphabet. Shaping the alphabet with error coding. ACTA ACUST UNITED AC 2006; 25:54-61. [PMID: 16485392 DOI: 10.1109/memb.2006.1578664] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Dónall A MacDónaill
- School of Chemistry, University of Dublin, Trinity College, Dublin, Ireland.
| |
Collapse
|
10
|
|
11
|
Affiliation(s)
- Gail L Rosen
- Center for Signal and Image Processing, Georgia Institute of Technology, Atlanta 30332-0250, USA.
| |
Collapse
|
12
|
Affiliation(s)
- Diego Luis Gonzalez
- Laboratorio di acustica musicale e architettonica, CNR-Fondazione Scuola di S. Giorgio, Venezia, Italy.
| | | | | |
Collapse
|
13
|
Affiliation(s)
- Manish K Gupta
- Department of Mathematics and Statistics, Queens University, Kingston, Ontario, Canada.
| |
Collapse
|
14
|
Abel DL, Trevors JT. Three subsets of sequence complexity and their relevance to biopolymeric information. Theor Biol Med Model 2005; 2:29. [PMID: 16095527 PMCID: PMC1208958 DOI: 10.1186/1742-4682-2-29] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2005] [Accepted: 08/11/2005] [Indexed: 11/24/2022] Open
Abstract
Genetic algorithms instruct sophisticated biological organization. Three qualitative kinds of sequence complexity exist: random (RSC), ordered (OSC), and functional (FSC). FSC alone provides algorithmic instruction. Random and Ordered Sequence Complexities lie at opposite ends of the same bi-directional sequence complexity vector. Randomness in sequence space is defined by a lack of Kolmogorov algorithmic compressibility. A sequence is compressible because it contains redundant order and patterns. Law-like cause-and-effect determinism produces highly compressible order. Such forced ordering precludes both information retention and freedom of selection so critical to algorithmic programming and control. Functional Sequence Complexity requires this added programming dimension of uncoerced selection at successive decision nodes in the string. Shannon information theory measures the relative degrees of RSC and OSC. Shannon information theory cannot measure FSC. FSC is invariably associated with all forms of complex biofunction, including biochemical pathways, cycles, positive and negative feedback regulation, and homeostatic metabolism. The algorithmic programming of FSC, not merely its aperiodicity, accounts for biological organization. No empirical evidence exists of either RSC of OSC ever having produced a single instance of sophisticated biological organization. Organization invariably manifests FSC rather than successive random events (RSC) or low-informational self-ordering phenomena (OSC).
Collapse
Affiliation(s)
- David L Abel
- Director, The Gene Emergence Project, The Origin-of-Life Foundation, Inc., 113 Hedgewood Dr., Greenbelt, MD 20770-1610 USA
| | - Jack T Trevors
- Professor, Department of Environmental Biology, University of Guelph, Rm 3220 Bovey Building, Guelph, Ontario, N1G 2W1, Canada
| |
Collapse
|
15
|
May EE, Vouk MA, Bitzer DL, Rosnick DI. Coding theory based models for protein translation initiation in prokaryotic organisms. Biosystems 2004; 76:249-60. [PMID: 15351148 DOI: 10.1016/j.biosystems.2004.05.017] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2003] [Revised: 07/11/2003] [Accepted: 08/01/2003] [Indexed: 11/21/2022]
Abstract
Our research explores the feasibility of using communication theory, error control (EC) coding theory specifically, for quantitatively modeling the protein translation initiation mechanism. The messenger RNA (mRNA) of Escherichia coli K-12 is modeled as a noisy (errored), encoded signal and the ribosome as a minimum Hamming distance decoder, where the 16S ribosomal RNA (rRNA) serves as a template for generating a set of valid codewords (the codebook). We tested the E. coli based coding models on 5' untranslated leader sequences of prokaryotic organisms of varying taxonomical relation to E. coli including: Salmonella typhimurium LT2, Bacillus subtilis, and Staphylococcus aureus Mu50. The model identified regions on the 5' untranslated leader where the minimum Hamming distance values of translated mRNA sub-sequences and non-translated genomic sequences differ the most. These regions correspond to the Shine-Dalgarno domain and the non-random domain. Applying the EC coding-based models to B. subtilis, and S. aureus Mu50 yielded results similar to those for E. coli K-12. Contrary to our expectations, the behavior of S. typhimurium LT2, the more taxonomically related to E. coli, resembled that of the non-translated sequence group.
Collapse
Affiliation(s)
- Elebeoba E May
- Computational Biology Department, Sandia National Laboratories, Albuquerque, NM 87185, USA.
| | | | | | | |
Collapse
|
16
|
Mac Dónaill DA. Why nature chose A, C, G and U/T: an error-coding perspective of nucleotide alphabet composition. ORIGINS LIFE EVOL B 2003; 33:433-55. [PMID: 14604185 DOI: 10.1023/a:1025715209867] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The question of whether the size and make-up of the natural nucleotide alphabet is a consequence of selection pressure, or simply a frozen accident, is one of the fundamental questions of biology. Nucleotide replication is essentially an information transmission phenomenon, and so it seems reasonable to explore the issue from the perspective of theoretical computer science, and of error-coding theory in particular. In this analysis it is shown that the essential recognition features of nucleotides may be naturally expressed as 4-digit binary numbers, capturing the hydrogen acceptor/donor patterns (3-bits) and the purine/pyrimidine feature (1-bit). Optimal alphabets consist of nucleotides in which the purine/pyrimidine feature is related to the acceptor/donor pattern as a parity bit. Numerically interpreted, such alphabets correspond to parity check codes, simple but effective error-resistant structures. The natural alphabet appears to be an adaptation of one of two optimal solutions, constrained to its present size and composition by a combination of chemical and coding-theory factors.
Collapse
|
17
|
Abstract
We describe a microarray design based on the concept of error-correcting codes from digital communication theory. Currently, microarrays are unable to efficiently deal with "drop-outs," when one or more spots on the array are corrupted. The resulting information loss may lead to decoding errors in which no quantitation of expression can be extracted for the corresponding genes. This issue is expected to become increasingly problematic as the number of spots on microarrays expands to accommodate the entire genome. The error-correcting approach employs multiplexing (encoding) of more than one gene onto each spot to efficiently provide robustness to drop-outs in the array. Decoding then allows fault-tolerant recovery of the expression information from individual genes. The error-correcting method is general and may have important implications for future array designs in research and diagnostics.
Collapse
Affiliation(s)
- Arshad H Khan
- Department of Molecular and Medical Pharmacology, Crump Institute for Biomedical Imaging, University of California at Los Angeles School of Medicine, 90095, USA
| | | | | | | |
Collapse
|
18
|
Barrette IH, McKenna S, Taylor DR, Forsdyke DR. Introns resolve the conflict between base order-dependent stem-loop potential and the encoding of RNA or protein: further evidence from overlapping genes. Gene 2001; 270:181-9. [PMID: 11404015 DOI: 10.1016/s0378-1119(01)00477-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Many eukaryotic genes are split into exons and introns, the latter being removed post-transcriptionally so that only exon sequences appear in cytoplasmic RNAs. Since introns appear in both protein-encoding RNAs and non-protein-coding RNAs, they interrupt genetic information per se, not just protein-encoding information. A DNA sequence has the potential to carry more than one type of genetic information, but different types may conflict. Thus, it has been proposed that introns arose because sequences were unable to contain concomitantly complete information for the encoding both of stem-loops and of cytoplasmic products (protein and/or RNA). Stem-loop potential is held to be selectively advantageous since it promotes the recombination-dependent correction of genetic errors. Stem-loop potential, the best local measure of which is base order-dependent stem-loop potential, tends to be less in exons than in introns. This is particularly evident in genes evolving rapidly under positive Darwinian selection, where the protein-encoding function is dominant. Evidence is now presented that the rare regions where genes overlap also impose excessive encoding demands so that the concomitant coding of base order-dependent stem-loop potential is decreased. Our results are consistent with the hypothesis that sequences with high stem-loop potential arose in the early 'RNA world'. Ancestors of modern genes would have entered this world when sequences (exons) encoding cytoplasmic products, were interspersed with sequences (introns) encoding selectively advantageous stem-loops. Purine-loading pressure would also have favoured intron formation.
Collapse
Affiliation(s)
- I H Barrette
- Department of Biochemistry, Queen's University, Kingston, K7L3N6, Ontario, Canada
| | | | | | | |
Collapse
|