1
|
Korotkov E, Zaytsev K, Fedorov A. Use of 6 Nucleotide Length Words to Study the Complexity of Gene Sequences from Different Organisms. ENTROPY (BASEL, SWITZERLAND) 2022; 24:632. [PMID: 35626518 PMCID: PMC9141341 DOI: 10.3390/e24050632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 04/23/2022] [Accepted: 04/27/2022] [Indexed: 12/02/2022]
Abstract
In this paper, we attempted to find a relation between bacteria living conditions and their genome algorithmic complexity. We developed a probabilistic mathematical method for the evaluation of k-words (6 bases length) occurrence irregularity in bacterial gene coding sequences. For this, the coding sequences from different bacterial genomes were analyzed and as an index of k-words occurrence irregularity, we used W, which has a distribution similar to normal. The research results for bacterial genomes show that they can be divided into two uneven groups. First, the smaller one has W in the interval from 170 to 475, while for the second it is from 475 to 875. Plants, metazoan and virus genomes also have W in the same interval as the first bacterial group. We suggested that second bacterial group coding sequences are much less susceptible to evolutionary changes than the first group ones. It is also discussed to use the W index as a biological stress value.
Collapse
Affiliation(s)
- Eugene Korotkov
- Institute of Bioengineering, Federal Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia
| | - Konstantin Zaytsev
- Bach Institute of Biochemistry, Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia; (K.Z.); (A.F.)
| | - Alexey Fedorov
- Bach Institute of Biochemistry, Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia; (K.Z.); (A.F.)
| |
Collapse
|
2
|
Korotkov EV, Kamionskya AM, Korotkova MA. Detection of Highly Divergent Tandem Repeats in the Rice Genome. Genes (Basel) 2021; 12:genes12040473. [PMID: 33806152 PMCID: PMC8064497 DOI: 10.3390/genes12040473] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Revised: 03/11/2021] [Accepted: 03/23/2021] [Indexed: 11/25/2022] Open
Abstract
Currently, there is a lack of bioinformatics approaches to identify highly divergent tandem repeats (TRs) in eukaryotic genomes. Here, we developed a new mathematical method to search for TRs, which uses a novel algorithm for constructing multiple alignments based on the generation of random position weight matrices (RPWMs), and applied it to detect TRs of 2 to 50 nucleotides long in the rice genome. The RPWM method could find highly divergent TRs in the presence of insertions or deletions. Comparison of the RPWM algorithm with the other methods of TR identification showed that RPWM could detect TRs in which the average number of base substitutions per nucleotide (x) was between 1.5 and 3.2, whereas T-REKS and TRF methods could not detect divergent TRs with x > 1.5. Applied to the search of TRs in the rice genome, the RPWM method revealed that TRs occupied 5% of the genome and that most of them were 2 and 3 bases long. Using RPWM, we also revealed the correlation of TRs with dispersed repeats and transposons, suggesting that some transposons originated from TRs. Thus, the novel RPWM algorithm is an effective tool to search for highly divergent TRs in the genomes.
Collapse
Affiliation(s)
- Eugene V Korotkov
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Bld.2, 33 Leninsky Ave., 119071 Moscow, Russia
- MEPhI (Moscow Engineering Physics Institute), National Research Nuclear University, 31 Kashirskoye Shosse, 115409 Moscow, Russia
| | - Anastasiya M Kamionskya
- Institute of Bioengineering, Research Center of Biotechnology of the Russian Academy of Sciences, Bld.2, 33 Leninsky Ave., 119071 Moscow, Russia
| | - Maria A Korotkova
- MEPhI (Moscow Engineering Physics Institute), National Research Nuclear University, 31 Kashirskoye Shosse, 115409 Moscow, Russia
| |
Collapse
|
3
|
Pugacheva V, Korotkov A, Korotkov E. Search of latent periodicity in amino acid sequences by means of genetic algorithm and dynamic programming. Stat Appl Genet Mol Biol 2017; 15:381-400. [PMID: 27337743 DOI: 10.1515/sagmb-2015-0079] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic programming and random weight matrices were used to develop a new mathematical algorithm for latent periodicity search. A multiple alignment of periods was calculated with help of the direct optimization of the position-weight matrix without using pairwise alignments. The developed algorithm was applied to analyze amino acid sequences of a small number of proteins. This study showed the presence of latent periodicity with insertions and deletions in the amino acid sequences of such proteins, for which the presence of latent periodicity was not previously known. The origin of latent periodicity with insertions and deletions is discussed.
Collapse
|
4
|
Korotkov EV, Korotkova MA. Developing a mathematical method to search for latent periodicity in protein amino-acid sequences with deletions and insertions. Biophysics (Nagoya-shi) 2015. [DOI: 10.1134/s0006350915060159] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
|
5
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
6
|
Simakova MN, Simakov NN. Computational methods for predicting structure of membrane proteins using amino acid sequences. Mol Biol 2013. [DOI: 10.1134/s0026893313010159] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
7
|
Sequence periodic pattern of HERV LTRs: A matrix simulation algorithm. J Biosci 2012; 37:19-24. [DOI: 10.1007/s12038-012-9182-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
8
|
Shelenkov A, Korotkov E. Search of regular sequences in promoters from eukaryotic genomes. Comput Biol Chem 2009; 33:196-204. [DOI: 10.1016/j.compbiolchem.2009.03.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2008] [Revised: 02/08/2009] [Accepted: 03/18/2009] [Indexed: 12/14/2022]
|
9
|
Paar V, Pavin N, Basar I, Rosandić M, Gluncić M, Paar N. Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats. BMC Bioinformatics 2008; 9:466. [PMID: 18980673 PMCID: PMC2661002 DOI: 10.1186/1471-2105-9-466] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 11/03/2008] [Indexed: 11/28/2022] Open
Abstract
Background Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats. Results We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor n for nmer) and higher harmonics. In general, nmer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/fβ – noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations. Conclusion DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of nmer HOR, i.e., the number n of monomers contained in consensus HOR.
Collapse
Affiliation(s)
- Vladimir Paar
- Faculty of Science, University of Zagreb, Bijenicka 32, Zagreb, Croatia.
| | | | | | | | | | | |
Collapse
|
10
|
Criteria for confirming sequence periodicity identified by Fourier transform analysis: Application to GCR2, a candidate plant GPCR? Biophys Chem 2008; 133:28-35. [DOI: 10.1016/j.bpc.2007.11.004] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2007] [Revised: 11/15/2007] [Accepted: 11/15/2007] [Indexed: 11/19/2022]
|
11
|
Chechetkin VR, Lobzin VV. Anticodons, frameshifts, and hidden periodicities in tRNA sequences. J Biomol Struct Dyn 2006; 24:189-202. [PMID: 16928142 DOI: 10.1080/07391102.2006.10507112] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Fourier analysis of the short-range periodicities for the complete set of sequences coding for tRNA genes in genome of Bacillus subtilis proves that periodicities with periods p = 2, 3, 4, and 6 sites are the inherent properties of tRNAs. The related periodicities should be understood in a broad statistical sense and their identifying needs the elaborate statistical methods. To improve the statistics, the analysis of significant periodicities was performed for the binary R-Y, S-W, and K-M sequences. Generally, such short-range periodicities are produced via biased positioning of particular nucleotides rather than via the tandem multiplication and subsequent modifications of repeats, though the latter mechanism may also be realized. Quasi-coherently piercing long segments of tRNA, the short-range periodicities create the effective long-range structural coupling between the acceptor stem and the anticodon loop and may participate in the mechanisms of molecular recognition. The periodicities with p = 2 and 4 provide the natural ground for the translation with spontaneous or programmed frameshifting and are present in tRNAs decoding the most frameshift-prone codons. The observation of short-range periodicities suggests that the mechanisms of amino-acylation of tRNAs and codon-anticodon pairing are not independent. Their study may also provide the important information related to the origin and evolution of the genetic code.
Collapse
Affiliation(s)
- V R Chechetkin
- Troitsk Institute of Innovation and Thermonuclear Investigations (TRINITI), Theoretical Department of Division for Perspective Investigations, 142190 Troitsk, Moscow Region, Russia.
| | | |
Collapse
|
12
|
Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Latent periodicity of serine-threonine and tyrosine protein kinases and other protein families. Comput Biol Chem 2005; 29:229-43. [PMID: 15979043 DOI: 10.1016/j.compbiolchem.2005.04.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2004] [Revised: 04/18/2005] [Accepted: 04/18/2005] [Indexed: 11/22/2022]
Abstract
We identified latent periodicity in catalytic domains of approximately 85% of annotated serine-threonine and tyrosine protein kinases. Similar results were obtained for other 22 protein families and domains. We also designed the method of noise decomposition, which is aimed to distinguish between different periodicity types of the same period length. The method is to be used in conjunction with the method of cyclic profile alignment, and this combination is able to reveal structure-related or function-related patterns of latent periodicity. Possible origins of the periodic structure of protein kinase active sites are discussed. Summarizing, we presume that latent periodicity is the common property of many catalytic protein domains.
Collapse
Affiliation(s)
- Andrew A Laskin
- Bioengineering Center of Russian Academy of Sciences, Prospect 60-tya Oktyabrya, 7/1, 117312 Moscow, Russia.
| | | | | | | |
Collapse
|
13
|
Balakirev ES, Chechetkin VR, Lobzin VV, Ayala FJ. Entropy and GC Content in the beta-esterase gene cluster of the Drosophila melanogaster subgroup. Mol Biol Evol 2005; 22:2063-72. [PMID: 15972847 DOI: 10.1093/molbev/msi197] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
We perform spectral entropy and GC content analyses in the beta-esterase gene cluster, including the Est-6 gene and the psiEst-6 putative pseudogene, in seven species of the Drosophila melanogaster species subgroup. psiEst-6 combines features of functional and nonfunctional genes. The spectral entropies show distinctly lower structural ordering for psiEst-6 than for Est-6 in all species studied. Our observations agree with previous results for D. melanogaster and provide additional support to our hypothesis that after the duplication event Est-6 retained the esterase-coding function and its role during copulation, while psiEst-6 lost that function but now operates in conjunction with Est-6 as an intergene. Entropy accumulation is not a completely random process for either gene. Structural entropy is nucleotide dependent. The relative normalized deviations for structural entropy are higher for G than for C nucleotides. The entropy values are similar for Est-6 and psiEst-6 in the case of A and T but are lower for Est-6 in the case of G and C. The GC content in synonymous positions is uniformly higher in Est-6 than in psiEst-6, which agrees with the reduced GC content generally observed in pseudogenes and nonfunctional sequences. The observed differences in entropy and GC content reflect an evolutionary shift associated with the process of pseudogenization and subsequent functional divergence of psiEst-6 and Est-6 after the duplication event.
Collapse
Affiliation(s)
- Evgeniy S Balakirev
- Department of Ecology and Evolutionary Biology, University of California, Irvine, CA, USA
| | | | | | | |
Collapse
|
14
|
Laskin AA, Kudryashov NA, Skryabin KG, Korotkov EV. Latent Periodicity of Serine/Threonine and Tyrosine Protein Kinases and Other Protein Families. Mol Biol 2005. [DOI: 10.1007/s11008-005-0052-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
15
|
Simakova MN, Simakov NN. Study of the periodic arrangement of amino acid residues in fiber proteins of bacteriophage T4. Mol Biol 2005. [DOI: 10.1007/s11008-005-0040-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
16
|
Abstract
Pseudogenes have been defined as nonfunctional sequences of genomic DNA originally derived from functional genes. It is therefore assumed that all pseudogene mutations are selectively neutral and have equal probability to become fixed in the population. Rather, pseudogenes that have been suitably investigated often exhibit functional roles, such as gene expression, gene regulation, generation of genetic (antibody, antigenic, and other) diversity. Pseudogenes are involved in gene conversion or recombination with functional genes. Pseudogenes exhibit evolutionary conservation of gene sequence, reduced nucleotide variability, excess synonymous over nonsynonymous nucleotide polymorphism, and other features that are expected in genes or DNA sequences that have functional roles. We first review the Drosophila literature and then extend the discussion to the various functional features identified in the pseudogenes of other organisms. A pseudogene that has arisen by duplication or retroposition may, at first, not be subject to natural selection if the source gene remains functional. Mutant alleles that incorporate new functions may, nevertheless, be favored by natural selection and will have enhanced probability of becoming fixed in the population. We agree with the proposal that pseudogenes be considered as potogenes, i.e., DNA sequences with a potentiality for becoming new genes.
Collapse
Affiliation(s)
- Evgeniy S Balakirev
- Department of Ecology and Evolutionary Biology, University of California, Irvine, California 92697-2525, USA.
| | | |
Collapse
|