1
|
Costa MO, Silva R, Anselmo DHAL. Superstatistical and DNA sequence coding of the human genome. Phys Rev E 2022; 106:064407. [PMID: 36671113 DOI: 10.1103/physreve.106.064407] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 11/16/2022] [Indexed: 12/14/2022]
Abstract
In this work, by considering superstatistics we investigate the short-range correlations (SRCs) and the fluctuations in the distribution of lengths of strings of nucleotides. To this end, a stochastic model provides the distributions of the size of the exons based on the q-Gamma and inverse q-Gamma distributions. Specifically, we define a time series for exon sizes to investigate the SRC and the fluctuations through the superstatistics distributions. To test the model's viability, we use the Project Ensembl database of genes to extract the time evolution of exon sizes, calculated in terms of the number of base pairs (bp) in these biological databases. Our findings show that, depending on the chromosome, both distributions are suitable for describing the length distribution of human DNA for lengths greater than 10 bp. In addition, we used Bayesian statistics to perform a selection model approach, which revealed weak evidence for the inverse q-Gamma distribution for a considerable number of chromosomes.
Collapse
Affiliation(s)
- M O Costa
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal - RN, 59072-970, Brasil
| | - R Silva
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal - RN, 59072-970, Brasil and Programa de Pós-Graduação em Física, Universidade do Estado do Rio Grande do Norte, Mossoró - Rio Grande do Norte, 59610-210, Brasil
| | - D H A L Anselmo
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal - RN, 59072-970, Brasil and Programa de Pós-Graduação em Física, Universidade do Estado do Rio Grande do Norte, Mossoró - Rio Grande do Norte, 59610-210, Brasil
| |
Collapse
|
2
|
de Lima MMF, Anselmo DHAL, Silva R, Nunes GHS, Fulco UL, Vasconcelos MS, Mello VD. A Bayesian Analysis of Plant DNA Length Distribution via κ-Statistics. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1225. [PMID: 36141111 PMCID: PMC9497530 DOI: 10.3390/e24091225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 06/16/2023]
Abstract
We report an analysis of the distribution of lengths of plant DNA (exons). Three species of Cucurbitaceae were investigated. In our study, we used two distinct κ distribution functions, namely, κ-Maxwellian and double-κ, to fit the length distributions. To determine which distribution has the best fitting, we made a Bayesian analysis of the models. Furthermore, we filtered the data, removing outliers, through a box plot analysis. Our findings show that the sum of κ-exponentials is the most appropriate to adjust the distribution curves and that the values of the κ parameter do not undergo considerable changes after filtering. Furthermore, for the analyzed species, there is a tendency for the κ parameter to lay within the interval (0.27;0.43).
Collapse
Affiliation(s)
- Maxsuel M. F. de Lima
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Dory H. A. L. Anselmo
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Raimundo Silva
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Glauber H. S. Nunes
- Departamento de Ciências Vegetais, Universidade Federal Rural do Semi-Árido, Mossoró 59625-900, RN, Brazil
| | - Umberto L. Fulco
- Departamento de Biofísica e Farmacologia, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Manoel S. Vasconcelos
- Departamento de Física, Universidade Federal do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| | - Vamberto D. Mello
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Natal 59072-970, RN, Brazil
| |
Collapse
|
3
|
Korotkov E, Zaytsev K, Fedorov A. Use of 6 Nucleotide Length Words to Study the Complexity of Gene Sequences from Different Organisms. ENTROPY (BASEL, SWITZERLAND) 2022; 24:632. [PMID: 35626518 PMCID: PMC9141341 DOI: 10.3390/e24050632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 04/23/2022] [Accepted: 04/27/2022] [Indexed: 12/02/2022]
Abstract
In this paper, we attempted to find a relation between bacteria living conditions and their genome algorithmic complexity. We developed a probabilistic mathematical method for the evaluation of k-words (6 bases length) occurrence irregularity in bacterial gene coding sequences. For this, the coding sequences from different bacterial genomes were analyzed and as an index of k-words occurrence irregularity, we used W, which has a distribution similar to normal. The research results for bacterial genomes show that they can be divided into two uneven groups. First, the smaller one has W in the interval from 170 to 475, while for the second it is from 475 to 875. Plants, metazoan and virus genomes also have W in the same interval as the first bacterial group. We suggested that second bacterial group coding sequences are much less susceptible to evolutionary changes than the first group ones. It is also discussed to use the W index as a biological stress value.
Collapse
Affiliation(s)
- Eugene Korotkov
- Institute of Bioengineering, Federal Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia
| | - Konstantin Zaytsev
- Bach Institute of Biochemistry, Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia; (K.Z.); (A.F.)
| | - Alexey Fedorov
- Bach Institute of Biochemistry, Research Center of Biotechnology of the Russian Academy of Sciences, 119071 Moscow, Russia; (K.Z.); (A.F.)
| |
Collapse
|
4
|
Bohnsack KS, Kaden M, Abel J, Saralajew S, Villmann T. The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1357. [PMID: 34682081 PMCID: PMC8534762 DOI: 10.3390/e23101357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 10/11/2021] [Accepted: 10/14/2021] [Indexed: 11/16/2022]
Abstract
In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.
Collapse
Affiliation(s)
- Katrin Sophie Bohnsack
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Marika Kaden
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Julia Abel
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| | - Sascha Saralajew
- Bosch Center for Artificial Intelligence, 71272 Renningen, Germany;
| | - Thomas Villmann
- Saxon Institute for Computational Intelligence and Machine Learning, University of Applied Sciences Mittweida, 09648 Mittweida, Germany; (M.K.); (J.A.)
| |
Collapse
|
5
|
Vondrak T, Ávila Robledillo L, Novák P, Koblížková A, Neumann P, Macas J. Characterization of repeat arrays in ultra-long nanopore reads reveals frequent origin of satellite DNA from retrotransposon-derived tandem repeats. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2020; 101:484-500. [PMID: 31559657 PMCID: PMC7004042 DOI: 10.1111/tpj.14546] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 09/09/2019] [Accepted: 09/12/2019] [Indexed: 05/21/2023]
Abstract
Amplification of monomer sequences into long contiguous arrays is the main feature distinguishing satellite DNA from other tandem repeats, yet it is also the main obstacle in its investigation because these arrays are in principle difficult to assemble. Here we explore an alternative, assembly-free approach that utilizes ultra-long Oxford Nanopore reads to infer the length distribution of satellite repeat arrays, their association with other repeats and the prevailing sequence periodicities. Using the satellite DNA-rich legume plant Lathyrus sativus as a model, we demonstrated this approach by analyzing 11 major satellite repeats using a set of nanopore reads ranging from 30 to over 200 kb in length and representing 0.73× genome coverage. We found surprising differences between the analyzed repeats because only two of them were predominantly organized in long arrays typical for satellite DNA. The remaining nine satellites were found to be derived from short tandem arrays located within LTR-retrotransposons that occasionally expanded in length. While the corresponding LTR-retrotransposons were dispersed across the genome, this array expansion occurred mainly in the primary constrictions of the L. sativus chromosomes, which suggests that these genome regions are favourable for satellite DNA accumulation.
Collapse
Affiliation(s)
- Tihana Vondrak
- Biology CentreCzech Academy of SciencesBranišovská 31České BudějoviceCZ‐37005Czech Republic
- Faculty of ScienceUniversity of South BohemiaČeské BudějoviceCzech Republic
| | - Laura Ávila Robledillo
- Biology CentreCzech Academy of SciencesBranišovská 31České BudějoviceCZ‐37005Czech Republic
- Faculty of ScienceUniversity of South BohemiaČeské BudějoviceCzech Republic
| | - Petr Novák
- Biology CentreCzech Academy of SciencesBranišovská 31České BudějoviceCZ‐37005Czech Republic
| | - Andrea Koblížková
- Biology CentreCzech Academy of SciencesBranišovská 31České BudějoviceCZ‐37005Czech Republic
| | - Pavel Neumann
- Biology CentreCzech Academy of SciencesBranišovská 31České BudějoviceCZ‐37005Czech Republic
| | - Jiří Macas
- Biology CentreCzech Academy of SciencesBranišovská 31České BudějoviceCZ‐37005Czech Republic
| |
Collapse
|
6
|
Costa MO, Silva R, Anselmo DHAL, Silva JRP. Analysis of human DNA through power-law statistics. Phys Rev E 2019; 99:022112. [PMID: 30934358 DOI: 10.1103/physreve.99.022112] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Indexed: 11/07/2022]
Abstract
We report an analysis of Homo sapiens DNA through the formalism of κ statistics, which encompasses power-law correlations and provides an optimization principle that permits us to model distinct physical systems; i.e., the power-law distribution of the length of DNA bases is calculated from a general model which follows arguments similar to those proposed in Maxwell's deduction of statistical distributions. The viability of the model is tested using a data set from a catalog of proteins collected from the Ensembl Project. The results indicate that the short-range correlations, always present in coding DNA sequences, are appropriately captured through the Kaniadakis power-law distribution, adequately describing the cumulative length distribution of DNA bases, in contrast with the case of the traditional exponential statistical model.
Collapse
Affiliation(s)
- M O Costa
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Mossoró, 59610-210, Brazil
| | - R Silva
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Mossoró, 59610-210, Brazil.,Universidade Federal do Rio Grande do Norte, Departamento de Física, Natal-RN, 59072-970, Brazil
| | - D H A L Anselmo
- Universidade Federal do Rio Grande do Norte, Departamento de Física, Natal-RN, 59072-970, Brazil
| | - J R P Silva
- Departamento de Física, Universidade do Estado do Rio Grande do Norte, Mossoró, 59610-210, Brazil
| |
Collapse
|
7
|
Li W, Freudenberg J, Freudenberg J. Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs) in the human genome. Gene 2019; 691:141-152. [PMID: 30630097 DOI: 10.1016/j.gene.2018.12.040] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2018] [Revised: 12/07/2018] [Accepted: 12/14/2018] [Indexed: 10/27/2022]
Abstract
The nuclear human genome harbors sequences of mitochondrial origin, indicating an ancestral transfer of DNA from the mitogenome. Several Nuclear Mitochondrial Segments (NUMTs) have been detected by alignment-based sequence similarity search, as implemented in the Basic Local Alignment Search Tool (BLAST). Identifying NUMTs is important for the comprehensive annotation and understanding of the human genome. Here we explore the possibility of detecting NUMTs in the human genome by alignment-free sequence similarity search, such as k-mers (k-tuples, k-grams, oligos of length k) distributions. We find that when k=6 or larger, the k-mer approach and BLAST search produce almost identical results, e.g., detect the same set of NUMTs longer than 3 kb. However, when k=5 or k=4, certain signals are only detected by the alignment-free approach, and these may indicate yet unrecognized, and potentially more ancestral NUMTs. We introduce a "Manhattan plot" style representation of NUMT predictions across the genome, which are calculated based on the reciprocal of the Jensen-Shannon divergence between the nuclear and mitochondrial k-mer frequencies. The further inspection of the k-mer-based NUMT predictions however shows that most of them contain long-terminal-repeat (LTR) annotations, whereas BLAST-based NUMT predictions do not. Thus, similarity of the mitogenome to LTR sequences is recognized, which we validate by finding the mitochondrial k-mer distribution closer to those for transposable sequences and specifically, close to some types of LTR.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Jerome Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Jan Freudenberg
- Regeneron Genetics Center, Regeneron Pharmaceuticals, Inc., Tarrytown, NY, USA
| |
Collapse
|
8
|
Cristadoro G, Degli Esposti M, Altmann EG. The common origin of symmetry and structure in genetic sequences. Sci Rep 2018; 8:15817. [PMID: 30361485 PMCID: PMC6202410 DOI: 10.1038/s41598-018-34136-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Accepted: 10/09/2018] [Indexed: 12/20/2022] Open
Abstract
Biologists have long sought a way to explain how statistical properties of genetic sequences emerged and are maintained through evolution. On the one hand, non-random structures at different scales indicate a complex genome organisation. On the other hand, single-strand symmetry has been scrutinised using neutral models in which correlations are not considered or irrelevant, contrary to empirical evidence. Different studies investigated these two statistical features separately, reaching minimal consensus despite sustained efforts. Here we unravel previously unknown symmetries in genetic sequences, which are organized hierarchically through scales in which non-random structures are known to be present. These observations are confirmed through the statistical analysis of the human genome and explained through a simple domain model. These results suggest that domain models which account for the cumulative action of mobile elements can explain simultaneously non-random structures and symmetries in genetic sequences.
Collapse
Affiliation(s)
- Giampaolo Cristadoro
- Dipartimento di Matematica e Applicazioni, Università di Milano-Bicocca, 20125, Milano, Italy.
| | | | - Eduardo G Altmann
- School of Mathematics and Statistics, University of Sydney, Sydney, 2006, NSW, Australia
| |
Collapse
|
9
|
Das L, Nanda S, Das JK. An integrated approach for identification of exon locations using recursive Gauss Newton tuned adaptive Kaiser window. Genomics 2018; 111:284-296. [PMID: 30342085 DOI: 10.1016/j.ygeno.2018.10.008] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2018] [Revised: 09/11/2018] [Accepted: 10/11/2018] [Indexed: 11/27/2022]
Abstract
Identification of exon location in a DNA sequence has been considered as the most demanding and challenging research topic in the field of Bioinformatics. This work proposes a robust approach combining the Trigonometric mapping with Adaptive tuned Kaiser Windowing approach for locating the protein coding regions (EXONS) in a genetic sequence. For better convergence as well as improved accurateness, the side lobe height control parameter (β) of Kaiser Window in the proposed algorithm is made adaptive to track the changing dynamics of the genetic sequence. This yields better tracking potential of the anticipated Adaptive Kaiser algorithm as it uses the recursive Gauss Newton tuning which in turn utilizes the covariance of the error signal to tune the β factor which has been shown through numerous simulation results under a variety of practical test conditions. A detailed comparative analysis with the existing mapping schemes, windowing techniques, and other signal processing methods like SVD, AN, DFT, STDFT, WT, and ST has also been included in the paper to focus on the strength and efficiency of the proposed approach. Moreover, some critical performance parameters have been computed using the proposed approach to investigate the effectiveness and robustness of the algorithm. In addition to this, the proposed approach has also been successfully applied on a number of benchmark gene sets like Musmusculus, Homosapiens, and C. elegans, etc., where the proposed approach revealed efficient prediction of exon location in contrast to the other existing mapping methods.
Collapse
Affiliation(s)
- Lopamudra Das
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| | - Sarita Nanda
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| | - J K Das
- School of Electronics Engineering, KIIT University, Bhubaneswar, India.
| |
Collapse
|
10
|
Li W, Thanos D, Provata A. Quantifying local randomness in human DNA and RNA sequences using Erdös motifs. J Theor Biol 2018; 461:41-50. [PMID: 30336158 DOI: 10.1016/j.jtbi.2018.09.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Revised: 08/14/2018] [Accepted: 09/25/2018] [Indexed: 10/28/2022]
Abstract
In 1932, Paul Erdös asked whether a random walk constructed from a binary sequence can achieve the lowest possible deviation (lowest discrepancy), for the sequence itself and for all its subsequences formed by homogeneous arithmetic progressions. Although avoiding low discrepancy is impossible for infinite sequences, as recently proven by Terence Tao, attempts were made to construct such sequences with finite lengths. We recognize that such constructed sequences (we call these "Erdös sequences") exhibit certain hallmarks of randomness at the local level: they show roughly equal frequencies of short subsequences, and at the same time exclude trivial periodic patterns. For the human DNA we examine the frequency of a set of Erdös motifs of length-10 using three nucleotides-to-binary mappings. The particular length-10 Erdös sequence is derived from the length-11 Mathias sequence and is identical with the first 10 digits of the Thue-Morse sequence, underscoring the fact that both are deficient in periodicities. Our calculations indicate that: (1) the purine(A and G)/pyridimine(C and T) based Erdös motifs are greatly underrepresented in the human genome, (2) the strong(G and C)/weak(A and T) based Erdös motifs are slightly overrepresented, (3) the densities of the two are negatively correlated, (4) the Erdös motifs based on all three mappings being combined are slightly underrepresented, and (5) the strong/weak based Erdös motifs are greatly overrepresented in the human messenger RNA sequences.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, Northwell Health, Manhasset, NY, USA.
| | - Dimitrios Thanos
- Department of Mathematics, National and Kapodistrian University of Athens, Athens GR-15784, Greece; Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", Athens GR-15341, Greece
| | - Astero Provata
- Institute of Nanoscience and Nanotechnology, National Center for Scientific Research "Demokritos", Athens GR-15341, Greece
| |
Collapse
|
11
|
Zhao J, Wang J, Jiang H. Detecting Periodicities in Eukaryotic Genomes by Ramanujan Fourier Transform. J Comput Biol 2018; 25:963-975. [PMID: 29963923 DOI: 10.1089/cmb.2017.0252] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Ramanujan Fourier transform (RFT) nowadays is becoming a popular signal processing method. RFT is used to detect periodicities in exons and introns of eukaryotic genomes in this article. Genomic sequences of nine species were analyzed. The highest peak in the spectrum amplitude corresponding to each exon or intron is regarded as the significant signal. Accordingly, the periodicity corresponding to the significant signal can be also regarded as a valuable periodicity. Exons and introns have different periodic phenomena. The computational results reveal that the 2-, 3-, 4-, and 6-base periodicities of exons and introns are four kinds of important periodicities based on RFT. It is the first time that the 2-base periodicity of introns is discovered through signal processing method. The frequencies of the 2-base periodicity and the 3-base periodicity occurrence are polar opposite between the exons and the introns. With regard to the cyclicality of the Ramanujan sums, which is the base function of the transformation, RFT is suggested for studying the periodic features of dinucleotides, trinucleotides, and q nucleotides.
Collapse
Affiliation(s)
- Jian Zhao
- 1 Department of Mathematics, Nanjing Tech University , Nanjing, China .,2 Department of Statistics, Northwestern University , Evanston, Illinois
| | - Jiasong Wang
- 3 Department of Mathematics, Nanjing University , Nanjing, China
| | - Hongmei Jiang
- 2 Department of Statistics, Northwestern University , Evanston, Illinois
| |
Collapse
|
12
|
Hota MK, Srivastava VK. A multirate DSP structure for the identification of protein-coding regions. INT J BIOMATH 2017. [DOI: 10.1142/s1793524517501121] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The identification of protein-coding regions in DNA sequence using digital signal processing methods is one of the central issues in bioinformatics. In this paper, a multirate structure is proposed for the identification of protein-coding regions whose input sampling rate is same as output sampling rate. The multirate structure consists of cascade combination of decimation filter, kernel filter and interpolation filter. The decimation filter is a complex filter, the kernel filter is an FIR lowpass filter and the interpolation filter isa moving average filter. Polyphase decomposition is applied on both decimation filter and interpolation filter for computationally efficient implementation. The potential of the proposed method is evaluated in comparison with existing methods using standard datasets. The results show that the proposed method improves the identification accuracy of protein-coding regions to a great extent compared to its counterparts.
Collapse
Affiliation(s)
- Malaya Kumar Hota
- School of Electronics Engineering, VIT University, Vellore 632014, Tamilnadu, India
| | - Vinay Kumar Srivastava
- Department of Electronics and Communication Engineering, Motilal Nehru National Institute of Technology, Allahabad 211004, Uttar Pradesh, India
| |
Collapse
|
13
|
George TP, Thomas T. Exon Mapping in Long Noncoding RNAs Using Digital Filters. GENOMICS INSIGHTS 2017; 10:1178631017732029. [PMID: 28989280 PMCID: PMC5624354 DOI: 10.1177/1178631017732029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2017] [Accepted: 08/18/2017] [Indexed: 11/16/2022]
Abstract
Long noncoding RNAs (lncRNAs) which were initially dismissed as "transcriptional noise" have become a vital area of study after their roles in biological regulation were discovered. Long noncoding RNAs have been implicated in various developmental processes and diseases. Here, we perform exon mapping of human lncRNA sequences (taken from National Center for Biotechnology Information GenBank) using digital filters. Antinotch digital filters are used to map out the exons of the lncRNA sequences analyzed. The period 3 property which is an established indicator for locating exons in genes is used here. Discrete wavelet transform filter bank is used to fine-tune the exon plots by selectively removing the spectral noise. The exon locations conform to the ranges specified in GenBank. In addition to exon prediction, G-C concentrations of lncRNA sequences are found, and the sequences are searched for START and STOP codons as these are indicators of coding potential.
Collapse
Affiliation(s)
- Tina P George
- Department of Electronics, Cochin University of Science and Technology (CUSAT), Kochi, India
| | - Tessamma Thomas
- Department of Electronics, Cochin University of Science and Technology (CUSAT), Kochi, India
| |
Collapse
|
14
|
Serrano-Solís V, Cocho G, José MV. Genomic signatures in viral sequences by in-frame and out-frame mutual information. J Theor Biol 2016; 403:1-9. [PMID: 27178876 DOI: 10.1016/j.jtbi.2016.05.014] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Revised: 04/25/2016] [Accepted: 05/03/2016] [Indexed: 11/28/2022]
Abstract
In order to understand the unique biology of viruses, we use the Mutual Information Function (MIF) to characterize 792 viral sequences comprising 458 viral whole genomes. A 3-base periodicity (3-bp) was observed only in DNA-viruses whereas RNA-viruses showed irregular patterns. The correlation of MIF values at frequencies of 3-bp (in-frame) with frequencies of 4 and 5bps (out-frame), turned out to be useful to distinguish viruses according to their respective taxonomic order, and whether they pertain to any of the three different kingdoms, Eubacteria, Archaea and Eukarya. The clustering of viruses was carried out by the use of a new statistics, namely, the pair of in- and out-frame values of the MIF. The clustering thus obtained turned out to be entirely consistent with the current viral taxonomy. As a result we were able to compare in a single plot both viral and cellular genomes unlike any given phylogenetic reconstruction.
Collapse
Affiliation(s)
| | - Germinal Cocho
- Instituto de Física, Universidad Nacional Autónoma de México (IFUNAM), Mexico.
| | - Marco V José
- Theoretical Biology Group, Instituto de Investigaciones Biomédicas, Universidad Nacional Autónoma de México, México D.F. 04510, Mexico.
| |
Collapse
|
15
|
Gouveia S, Scotto MG, Weiß CH, Ferreira PJSG. Binary auto-regressive geometric modelling in a DNA context. J R Stat Soc Ser C Appl Stat 2016. [DOI: 10.1111/rssc.12172] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
16
|
Paci G, Cristadoro G, Monti B, Lenci M, Degli Esposti M, Castellani GC, Remondini D. Characterization of DNA methylation as a function of biological complexity via dinucleotide inter-distances. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2016; 374:rsta.2015.0227. [PMID: 26857665 DOI: 10.1098/rsta.2015.0227] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/23/2015] [Indexed: 06/05/2023]
Abstract
We perform a statistical study of the distances between successive occurrences of a given dinucleotide in the DNA sequence for a number of organisms of different complexity. Our analysis highlights peculiar features of the CG dinucleotide distribution in mammalian DNA, pointing towards a connection with the role of such dinucleotide in DNA methylation. While the CG distributions of mammals exhibit exponential tails with comparable parameters, the picture for the other organisms studied (e.g. fish, insects, bacteria and viruses) is more heterogeneous, possibly because in these organisms DNA methylation has different functional roles. Our analysis suggests that the distribution of the distances between CG dinucleotides provides useful insights into characterizing and classifying organisms in terms of methylation functionalities.
Collapse
Affiliation(s)
- Giulia Paci
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Giampaolo Cristadoro
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Barbara Monti
- Department of Pharmacy and Biotechnology, University of Bologna, Via S. Donato 15, Bologna 40127, Italy
| | - Marco Lenci
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Mirko Degli Esposti
- Department of Mathematics, University of Bologna, Piazza di Porta S. Donato 5, Bologna 40126, Italy
| | - Gastone C Castellani
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| | - Daniel Remondini
- Department of Physics and Astronomy, University of Bologna, Viale B. Pichat 6/2, Bologna 40127, Italy Bologna Unit, INFN, Viale B. Pichat 6/2, Bologna 40127, Italy
| |
Collapse
|
17
|
Zhao L, Cao D, Gao Z, Mi B, Huang W. Label-Free DNA Sensors Based on Field-Effect Transistors with Semiconductor of Carbon Materials. CHINESE J CHEM 2015. [DOI: 10.1002/cjoc.201500254] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
18
|
Colliva A, Pellegrini R, Testori A, Caselle M. Ising-model description of long-range correlations in DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2015; 91:052703. [PMID: 26066195 DOI: 10.1103/physreve.91.052703] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Indexed: 06/04/2023]
Abstract
We model long-range correlations of nucleotides in the human DNA sequence using the long-range one-dimensional (1D) Ising model. We show that, for distances between 10(3) and 10(6) bp, the correlations show a universal behavior and may be described by the non-mean-field limit of the long-range 1D Ising model. This allows us to make some testable hypothesis on the nature of the interaction between distant portions of the DNA chain which led to the DNA structure that we observe today in higher eukaryotes.
Collapse
Affiliation(s)
- A Colliva
- Dipartimento di Fisica dell'Università di Torino and I.N.F.N. sez. di Torino, Via Pietro Giuria 1, I-10125 Torino, Italy
| | - R Pellegrini
- Physics Department, Swansea University, Singleton Park, Swansea SA2 8PP, UK
| | - A Testori
- Dipartimento di Fisica dell'Università di Torino and I.N.F.N. sez. di Torino, Via Pietro Giuria 1, I-10125 Torino, Italy
| | - M Caselle
- Dipartimento di Fisica dell'Università di Torino and I.N.F.N. sez. di Torino, Via Pietro Giuria 1, I-10125 Torino, Italy
| |
Collapse
|
19
|
Bacterial genomes lacking long-range correlations may not be modeled by low-order Markov chains: The role of mixing statistics and frame shift of neighboring genes. Comput Biol Chem 2014; 53 Pt A:15-25. [DOI: 10.1016/j.compbiolchem.2014.08.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
20
|
Abstract
Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 104 bases, or 105 − 106 bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of 103 bases. With a read length of 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1 kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37/hg19). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move toward the direction of 100% mappability.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System Manhasset, NY, USA
| | - Jan Freudenberg
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System Manhasset, NY, USA
| |
Collapse
|
21
|
Suvorova YM, Korotkova MA, Korotkov EV. Comparative analysis of periodicity search methods in DNA sequences. Comput Biol Chem 2014; 53 Pt A:43-8. [PMID: 25218218 DOI: 10.1016/j.compbiolchem.2014.08.008] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2014] [Indexed: 11/30/2022]
Abstract
To determine the periodicity of a DNA sequence, different spectral approaches are applied (discrete Fourier transform (DFT), autocorrelation (CORR), information decomposition (ID), hybrid method (HYB), concept of spectral envelope for spectral analysis (SE), normalized autocorrelation (CORR_N) and profile analysis (PA). In this work, we investigated the possibility of finding the true period length, by depending on the average number of accumulated changes in DNA bases (PM) for the methods stated above. The results show that for periods with short length (≤4 b.p), it is possible to use the hybrid method (HYB), which combines properties of autocorrelation, Fourier transform, and information decomposition (ID). For larger period lengths (>4) with values of point mutation (PM) equal to 1.0 or more per one nucleotide, it is preferable to use information of decomposition method (ID), as the other spectral approaches cannot achieve correct determination of the period length present in the analyzed sequence.
Collapse
Affiliation(s)
- Yulia M Suvorova
- Centre of Bioengineering Russian Academy of Sciences, Prospect 60-tya Oktyabrya 7/1, Moscow 117312, Russian Federation.
| | - Maria A Korotkova
- National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Kashirskoe Shosse, 31, Moscow 115522, Russian Federation.
| | - Eugene V Korotkov
- Centre of Bioengineering Russian Academy of Sciences, Prospect 60-tya Oktyabrya 7/1, Moscow 117312, Russian Federation; National Research Nuclear University MEPhI (Moscow Engineering Physics Institute), Kashirskoe Shosse, 31, Moscow 115522, Russian Federation.
| |
Collapse
|
22
|
Suvorova YM, Korotkova MA, Korotkov EV. Study of the Paired Change Points in Bacterial Genes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:955-964. [PMID: 26356866 DOI: 10.1109/tcbb.2014.2321154] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
It is known that nucleotide sequences are not totally homogeneous and this heterogeneity could not be due to random fluctuations only. Such heterogeneity poses a problem of making sequence segmentation into a set of homogeneous parts divided by the points called "change points". In this work we investigated a special case of change points-paired change points (PCP). We used a well-known property of coding sequences-triplet periodicity (TP). The sequences that we are especially interested in consist of three successive parts: the first and the last parts have similar TP while the middle part has different TP type. We aimed to find the genes with PCP and provide explanation for this phenomenon. We developed a mathematical method for the PCP detection based on the new measure of similarity between TP matrices. We investigated 66,936 bacterial genes from 17 bacterial genomes and revealed 2,700 genes with PCP and 6,459 genes with single change point (SCP). We developed a mathematical approach to visualize the PCP cases. We suppose that PCP could be associated with double fusion or insertion events. The results of investigating the sequences with artificial insertions/fusions and distribution of TP inside the genome support the idea that the real number of genes formed by insertion/ fusion events could be 5-7 times greater than the number of genes revealed in the present work.
Collapse
|
23
|
Variation and constraints in species-specific promoter sequences. J Theor Biol 2014; 363:357-66. [PMID: 25149367 DOI: 10.1016/j.jtbi.2014.08.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2014] [Revised: 07/30/2014] [Accepted: 08/04/2014] [Indexed: 11/24/2022]
Abstract
A vast literature is nowadays devoted to the search of correlations between transcription related functions and the composition of sequences upstream the Transcription Start Site. Little is known about the possible functional effects of nucleotide distributions on the conformational landscape of DNA in such regions. We have used suitable statistical indicators for identifying sequences that may play an important role in regulating transcription processes. In particular, we have analyzed base composition, periodicity and information content in sets of aligned promoters clustered according to functional information in order to obtain an insight on the main structural differences between promoters regulating genes with different functions. Our results show that when we select promoters according to some biological information, in a single species, at least in vertebrates, we observe structurally different classes of sequences. The highly variable and differentiated gene expression patterns may explain the great extent of structural differentiation observed in complex organisms. In fact, despite our analysis is focused on Homo sapiens, we provide also a comparison with other species, selected at different positions in the phylogenetic tree.
Collapse
|
24
|
Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014; 15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open
Abstract
Information theory (IT) addresses the analysis of communication systems and has been widely applied in molecular biology. In particular, alignment-free sequence analysis and comparison greatly benefited from concepts derived from IT, such as entropy and mutual information. This review covers several aspects of IT applications, ranging from genome global analysis and comparison, including block-entropy estimation and resolution-free metrics based on iterative maps, to local analysis, comprising the classification of motifs, prediction of transcription factor binding sites and sequence characterization based on linguistic complexity and entropic profiles. IT has also been applied to high-level correlations that combine DNA, RNA or protein features with sequence-independent properties, such as gene mapping and phenotype analysis, and has also provided models based on communication systems theory to describe information transmission channels at the cell level and also during evolutionary processes. While not exhaustive, this review attempts to categorize existing methods and to indicate their relation with broader transversal topics such as genomic signatures, data compression and complexity, time series analysis and phylogenetic classification, providing a resource for future developments in this promising area.
Collapse
Affiliation(s)
- Susana Vinga
- IDMEC, Instituto Superior Técnico - Universidade de Lisboa (IST-UL), Av. Rovisco Pais, 1049-001 Lisboa, Portugal. Tel.: +351-218419504; Fax: +351-218498097;
| |
Collapse
|
25
|
Li W, Freudenberg J, Miramontes P. Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome. BMC Bioinformatics 2014; 15:2. [PMID: 24386976 PMCID: PMC3927684 DOI: 10.1186/1471-2105-15-2] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2013] [Accepted: 12/17/2013] [Indexed: 11/10/2022] Open
Abstract
Background The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a longer read is more likely to be uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 bp to 1000 bp. Results We observe that the proportion of non-singletons k-mers decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different ranges of k. A slower decay at greater values for k indicates more limited gains in mappability for read lengths between 200 bp and 1000 bp. The frequency distributions of k-mers exhibit long tails with a power-law-like trend, and rank frequency plots exhibit a concave Zipf’s curve. The most frequent 1000-mers comprise 172 regions, which include four large stretches on chromosomes 1 and X, containing genes of biomedical relevance. Comparison with other databases indicates that the 172 regions can be broadly classified into two types: those containing LINE transposable elements and those containing segmental duplications. Conclusion Read mappability as measured by the proportion of singletons increases steadily up to the length scale around 200 bp. When read length increases above 200 bp, smaller gains in mappability are expected. Moreover, the proportion of non-singletons decreases with read lengths much slower than linear. Even a read length of 1000 bp would not allow the unique alignment of reads for many coding regions of human genes. A mix of techniques will be needed for efficiently producing high-quality data that cover the complete human genome.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S, Boas Center for Genomics and Human Genetic, The Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, USA.
| | | | | |
Collapse
|
26
|
Abstract
The power spectra of the nucleotides in the coding and noncoding sequences of the complete genomes of twenty-two archaea and bacteria are obtained. According to the intensities at the periodicity of 3 bp in the spectra, it is observed that the genomic sequences may be classified into three types. Moreover, the spectra generally have a small but broad peak in the 10–11 bp periodicities. For the archaea, the peak is seen to locate preferably at about 10 bp periodicity, while for the bacteria, it tends to locate at about 11 bp. These features suggest that the DNA sequences of archaea generally have a tighter double helical structure than those of bacteria in order to cope with harsh environmental conditions. Besides, among the archaea, A. Pernixi K1 is found to have the largest periodicity of about 11 bp, but has a comparatively high CG content in its genome and hence a high denaturation temperature.
Collapse
Affiliation(s)
- SU-LONG NYEO
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, R.O.C
| | - I-CHING YANG
- Department of Natural Science Education, National Taitung Teachers College, Taitung, Taiwan 950, R.O.C
| | - CHI-HAO WU
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, R.O.C
| |
Collapse
|
27
|
Abstract
The distributions of codons in the DNA sequence of Escherichia coli K-12 are studied by using several statistical methods of analysis. Codons corresponding to the amino acids leucine, alanine and isoleucine are considered. The pair distributions of the codons as a function of the pair separation are evaluated and are seen to decay exponentially. The exponential decay constants have a linear relation with the numbers of the codons, indicating that the codons are randomly distributed in the sequence. The pair correlation and power spectral methods also show similar statistical behavior of codons in the sequence, with the exception that there appear very small peaks about the frequency f=0.286 in the power spectra of the amino acids leucine, alanine and isoleucine. Such a frequency reflects a periodicity of about 3.5 amino acids and a general helical structure of the proteins of the bacterium.
Collapse
Affiliation(s)
- SU-LONG NYEO
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, Republic of China
| | - I-CHING YANG
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, Republic of China
| |
Collapse
|
28
|
Koester B, Rea TJ, Templeton AR, Szalay AS, Sing CF. Long-range autocorrelations of CpG islands in the human genome. PLoS One 2012; 7:e29889. [PMID: 22253817 PMCID: PMC3256200 DOI: 10.1371/journal.pone.0029889] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2011] [Accepted: 12/07/2011] [Indexed: 01/24/2023] Open
Abstract
In this paper, we use a statistical estimator developed in astrophysics to study the distribution and organization of features of the human genome. Using the human reference sequence we quantify the global distribution of CpG islands (CGI) in each chromosome and demonstrate that the organization of the CGI across a chromosome is non-random, exhibits surprisingly long range correlations (10 Mb) and varies significantly among chromosomes. These correlations of CGI summarize functional properties of the genome that are not captured when considering variation in any particular separate (and local) feature. The demonstration of the proposed methods to quantify the organization of CGI in the human genome forms the basis of future studies. The most illuminating of these will assess the potential impact on phenotypic variation of inter-individual variation in the organization of the functional features of the genome within and among chromosomes, and among individuals for particular chromosomes.
Collapse
Affiliation(s)
- Benjamin Koester
- Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Thomas J. Rea
- Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Alan R. Templeton
- Department of Biology, Washington University, St Louis, Missouri, United States of America
| | - Alexander S. Szalay
- Department of Physics and Astronomy, Center for Astrophysical Sciences, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Charles F. Sing
- Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail:
| |
Collapse
|
29
|
Koroteev MV, Miller J. Scale-free duplication dynamics: a model for ultraduplication. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2011; 84:061919. [PMID: 22304128 DOI: 10.1103/physreve.84.061919] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Revised: 07/04/2011] [Indexed: 05/31/2023]
Abstract
Empirical studies of the genome-wide length distribution of duplicated sequences have revealed an algebraic tail common to nearly all clades. The decay of the tail is often well approximated by a single exponent that takes values within a limited range. We propose and study here scale-free duplication dynamics, a class of model for genome sequence evolution that generates the observed shapes of this distribution. A transition between self-similar and non-self-similar regimes is exhibited. Our model accounts plausibly for the observed form of the algebraic tail, which is not produced by standard models for generating long-range sequence correlations.
Collapse
Affiliation(s)
- M V Koroteev
- Physics and Biology Unit, Okinawa Institute of Science and Technology Suzaki 12-22, Uruma, Okinawa 904-2234, Japan
| | | |
Collapse
|
30
|
On the existence of wavelet symmetries in archaea DNA. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2011; 2012:673934. [PMID: 22481976 PMCID: PMC3310297 DOI: 10.1155/2012/673934] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/13/2011] [Revised: 10/27/2011] [Accepted: 10/29/2011] [Indexed: 11/19/2022]
Abstract
This paper deals with the complex unit roots representation
of archea DNA sequences and the analysis of symmetries in
the wavelet coefficients of the digitalized sequence. It is shown that
even for extremophile archaea, the distribution of nucleotides
has to fulfill some (mathematical) constraints in such a way that the
wavelet coefficients are symmetrically distributed, with respect to the
nucleotides distribution.
Collapse
|
31
|
Hsu TH, Nyeo SL. Simple Deviation Analysis of Two-Dimensional Viral DNA Walks. J BIOL SYST 2011. [DOI: 10.1142/s0218339003000841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We consider the method of two-dimensional DNA walks based on three independent groups of mapping rules for 21 DNA sequences of animal and plant viruses, and for the sequences of irrational and random numbers. This method provides a visualization tool for the determination of the regional abundance of nucleotides in DNA sequences. By defining a statistical deviation and a maximum-deviation ratio for a DNA walk, we find that the maximum-deviation ratios for the 21 viral DNA sequences are generally larger than those of the random-number sequences of same lengths. It is shown that the viral DNA sequences generally have the smallest maximum-deviations with the same mapping group, and that greater difference between CG and AT contents is associated with larger maximum-deviation ratio. Also it is possible to distinguish a viral DNA sequence from a random-number sequence if the lengths of the sequences are longer than 2000 base-pairs. Other possible applications of the two-dimensional DNA walks are mentioned.
Collapse
Affiliation(s)
- Tai-Hsin Hsu
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, R.O.C
| | - Su-Long Nyeo
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, R.O.C
| |
Collapse
|
32
|
STARIKOV EB, HENNIG D, YAMADA H, GUTIERREZ R, NORDÉN B, CUNIBERTI G. SCREW MOTION OF DNA DUPLEX DURING TRANSLOCATION THROUGH PORE I: INTRODUCTION OF THE COARSE-GRAINED MODEL. ACTA ACUST UNITED AC 2011. [DOI: 10.1142/s1793048009000995] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Based upon the structural properties of DNA duplexes and their counterion-water surrounding in solution, we have introduced here a screw model which may describe translocation of DNA duplexes through artificial nanopores of the proper diameter (where the DNA counterion–hydration shell can be intact) in a qualitatively correct way. This model represents DNA as a kind of "screw," whereas the counterion-hydration shell is a kind of "nut." Mathematical conditions for stable dynamics of the DNA screw model are investigated in detail. When an electrical potential is applied across an artificial membrane with a nanopore, the "screw" and "nut" begin to move with respect to each other, so that their mutual rotation is coupled with their mutual translation. As a result, there are peaks of electrical current connected with the mutual translocation of DNA and its counterion–hydration shell, if DNA is possessed of some non-regular base-pair sequence. The calculated peaks of current strongly resemble those observed in the pertinent experiments. An analogous model could in principle be applied to DNA translocation in natural DNA–protein complexes of biological interest, where the role of "nut" would be played by protein-tailored "channels." In such cases, the DNA screw model is capable of qualitatively explaining chemical-to-mechanical energy conversion in DNA–protein molecular machines via symmetry breaking in DNA–protein friction.
Collapse
Affiliation(s)
- E. B. STARIKOV
- Institute for Materials Science, Technical University of Dresden, D-01062 Dresden, Germany
- Institute for Theoretical Solid State Physics, University of Karlsruhe, Wolfgang-Gaede Str.1, D-76131 Karlsruhe, Germany
| | - D. HENNIG
- Institute for Physics, Humboldt University of Berlin, Newtonstraße 15, D-12489 Berlin, Germany
| | - H. YAMADA
- Yamada Physics Research Laboratory, Aoyama 5-7-14-205, Niigata 950-2002, Japan
| | - R. GUTIERREZ
- Institute for Materials Science, Technical University of Dresden, D-01062 Dresden, Germany
| | - B. NORDÉN
- Department of Physical Chemistry, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden
| | - G. CUNIBERTI
- Institute for Materials Science, Technical University of Dresden, D-01062 Dresden, Germany
| |
Collapse
|
33
|
Bielińska-Wąż D. Graphical and numerical representations of DNA sequences: statistical aspects of similarity. JOURNAL OF MATHEMATICAL CHEMISTRY 2011; 49:2345. [PMID: 32214591 PMCID: PMC7087963 DOI: 10.1007/s10910-011-9890-8] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/18/2011] [Accepted: 07/22/2011] [Indexed: 05/10/2023]
Abstract
New approaches aiming at a detailed similarity/dissimilarity analysis of DNA sequences are formulated. Several corrections that enrich the information which may be derived from the alignment methods are proposed. The corrections take into account the distributions along the sequences of the aligned bases (neglected in the standard alignment methods). As a consequence, different aspects of similarity, as for example asymmetry of the gene structure, may be studied either using new similarity measures associated with four-component spectral representation of the DNA sequences or using alignment methods with corrections introduced in this paper. The corrections to the alignment methods and the statistical distribution moment-based descriptors derived from the four-component spectral representation of the DNA sequences are applied to similarity/dissimilarity studies of β-globin gene across species. The studies are supplemented by detailed similarity studies for histones H1 and H4 coding sequences. The data are described according to the latest version of the EMBL database. The work is supplemented by a concise review of the state-of-art graphical representations of DNA sequences.
Collapse
Affiliation(s)
- Dorota Bielińska-Wąż
- Instytut Fizyki, Uniwersytet Mikołaja Kopernika, Grudziądzka 5, 87-100 Toruń, Poland
| |
Collapse
|
34
|
Abstract
Novel methods for identifying a new type of DNA latent periodicity, called latent profile periodicity or latent profility, are used to search for periodic structures in genes. These methods reveal two distinct levels of organization of genetic information encoding. It is shown that latent profility in genes may correlate with specific structural features of their encoded proteins.
Collapse
Affiliation(s)
- Maria Chaley
- Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Institutskaya st., 4, 142290 Pushchino, Russia.
| | | |
Collapse
|
35
|
Sánchez J. 3-base periodicity in coding DNA is affected by intercodon dinucleotides. Bioinformation 2011; 6:327-9. [PMID: 21814388 PMCID: PMC3143393 DOI: 10.6026/97320630006327] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2011] [Accepted: 07/12/2011] [Indexed: 01/29/2023] Open
Abstract
All coding DNAs exhibit 3-base periodicity (TBP), which may be defined as the tendency of nucleotides and higher order n-tuples, e.g. trinucleotides (triplets), to be preferentially spaced by 3, 6, 9 etc, bases, and we have proposed an association between TBP and clustering of same-phase triplets. We here investigated if TBP was affected by intercodon dinucleotide tendencies and whether clustering of same-phase triplets was involved. Under constant protein sequence intercodon dinucleotide frequencies depend on the distribution of synonymous codons. So, possible effects were revealed by randomly exchanging synonymous codons without altering protein sequences to subsequently document changes in TBP via frequency distribution of distances (FDD) of DNA triplets. A tripartite positive correlation was found between intercodon dinucleotide frequencies, clustering of same-phase triplets and TBP. So, intercodon C|A (where "|" indicates the boundary between codons) was more frequent in native human DNA than in the codon-shuffled sequences; higher C|A frequency occurred along with more frequent clustering of C|AN triplets (where N jointly represents A, C, G and T) and with intense CAN TBP. The opposite was found for C|G, which was less frequent in native than in shuffled sequences; lower C|G frequency occurred together with reduced clustering of C|GN triplets and with less intense CGN TBP. We hence propose that intercodon dinucleotides affect TBP via same-phase triplet clustering. A possible biological relevance of our findings is briefly discussed.
Collapse
Affiliation(s)
- Joaquín Sánchez
- Facultad de Medicina, Universidad Autónoma del Estado de Morelos, Cuernavaca, 62020 México
| |
Collapse
|
36
|
Algebraic distribution of segmental duplication lengths in whole-genome sequence self-alignments. PLoS One 2011; 6:e18464. [PMID: 21779315 PMCID: PMC3136455 DOI: 10.1371/journal.pone.0018464] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2010] [Accepted: 03/08/2011] [Indexed: 01/25/2023] Open
Abstract
Distributions of duplicated sequences from genome self-alignment are characterized, including forward and backward alignments in bacteria and eukaryotes. A Markovian process without auto-correlation should generate an exponential distribution expected from local effects of point mutation and selection on localised function; however, the observed distributions show substantial deviation from exponential form – they are roughly algebraic instead – suggesting a novel kind of long-distance correlation that must be non-local in origin.
Collapse
|
37
|
Calistri E, Livi R, Buiatti M. Evolutionary trends of GC/AT distribution patterns in promoters. Mol Phylogenet Evol 2011; 60:228-35. [PMID: 21554969 DOI: 10.1016/j.ympev.2011.04.015] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2010] [Revised: 03/25/2011] [Accepted: 04/17/2011] [Indexed: 11/18/2022]
Abstract
Nucleotide distributions in genomes is known not to be random, showing the presence of specific motifs, long and short range correlations, periodicities, etc. Particularly, motifs are critical for the recognition by specific proteins affecting chromosome organization, transcription and DNA replication but little is known about the possible functional effects of nucleotide distributions on the conformational landscape of DNA, putatively leading to differential selective pressures throughout evolution. Promoter sequences have a fundamental role in the regulation of gene activity and a vast literature suggests that their conformational landscapes may be a critical factor in gene expression dynamics. On these grounds, with the aim of investigating the putative existence of phylogenetic patterns of promoter base distributions, we analyzed GC/AT ratios along the 1000 nucleotide sequences upstream of TSS in wide sets of promoters belonging to organisms ranging from bacteria to pluricellular eukaryotes. The data obtained showed very clear phylogenetic trends throughout evolution of promoter sequence base distributions. Particularly, in all cases either GC-rich or AT-rich monotone gradients were observed: the former being present in eukaryotes, the latter in bacteria along with strand biases. Moreover, within eukaryotes, GC-rich gradients increased in length from unicellular organisms to plants, to vertebrates and, within them, from ancestral to more recent species. Finally, results were thoroughly discussed with particular attention to the possible correlation between nucleotide distribution patterns, evolution, and the putative existence of differential selection pressures, deriving from structural and/or functional constraints, between and within prokaryotes and eukaryotes.
Collapse
Affiliation(s)
- Elisa Calistri
- Dipartimento di Biologia Evoluzionistica, Universita' degli Studi di Firenze, via Romana 19, 50125 Firenze, Italy.
| | | | | |
Collapse
|
38
|
Epps J, Ying H, Huttley GA. Statistical methods for detecting periodic fragments in DNA sequence data. Biol Direct 2011; 6:21. [PMID: 21527008 PMCID: PMC3111405 DOI: 10.1186/1745-6150-6-21] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2010] [Accepted: 04/28/2011] [Indexed: 11/10/2022] Open
Abstract
Background Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed. Results We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS). Conclusions For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {AA, TT, TA}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers. Reviewers This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.
Collapse
Affiliation(s)
- Julien Epps
- School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia.
| | | | | |
Collapse
|
39
|
CHEN RM, HOU MT, CHANG NW, CHEN YT, TSAI JEFFREYJP. CUMULATIVE SPECTRAL REPEAT FINDER (CSRF): A SPECTRAL APPROACH FOR IDENTIFYING THE LENGTH OF REPEATS IN DNA SEQUENCES. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213011000073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Repetitive sequences of DNA are meaningful and of great importance to human functions. Previous researchers have proposed various methods to discover repetitive sequences in DNA sequence. However, the unknown lengths for repetitive sequences are usually predicted randomly or determined by rules of thumb rather than using a systematical criterion. We propose a new algorithm based on the cumulative Fourier spectral contents of DNA sequence to identify the candidate lengths of repetitive sequences or repeats in DNA sequences. After the candidate lengths of repeats are known, one can identify the repeats and their copy numbers using an exact method. Both of the simulated and real datasets are used to illustrate the performance of the proposed algorithm. The results are also compared to two well-known methods such as Spectral Repeat Finder (SRF) and Gibbs sampler. Furthermore, we demonstrate the use of CSRF in some well-known repeats-finding methods such as SRF, Gibbs sampler, MEME.
Collapse
Affiliation(s)
- R. M. CHEN
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - M. T. HOU
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - N. W. CHANG
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - Y. T. CHEN
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - JEFFREY J. P. TSAI
- Department of Computer Science, University of Illinois, Chicago, Chicago, IL 60607, USA
- Department of Bioinformatics, Asia University, Taichung, Taiwan 41354, Taiwan
| |
Collapse
|
40
|
Guerra JCDO, Licinio P. The role played by exons in genomic DNA sequence correlations. J Theor Biol 2010; 264:830-7. [DOI: 10.1016/j.jtbi.2010.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2009] [Revised: 02/23/2010] [Accepted: 03/02/2010] [Indexed: 10/19/2022]
|
41
|
Abstract
In this report, we compared the success rate of classification of coding sequences (CDS) vs. introns by Codon Structure Factor (CSF) and by a method that we called Universal Feature Method (UFM). UFM is based on the scoring of purine bias (Rrr) and stop codon frequency. We show that the success rate of CDS/intron classification by UFM is higher than by CSF. UFM classifies ORFs as coding or non-coding through a score based on (i) the stop codon distribution, (ii) the product of purine probabilities in the three positions of nucleotide triplets, (iii) the product of Cytosine (C), Guanine (G), and Adenine (A) probabilities in the 1st, 2nd, and 3rd positions of triplets, respectively, (iv) the probabilities of G in 1st and 2nd position of triplets and (v) the distance of their GC3 vs. GC2 levels to the regression line of the universal correlation. More than 80% of CDSs (true positives) of Homo sapiens (>250 bp), Drosophila melanogaster (>250 bp) and Arabidopsis thaliana (>200 bp) are successfully classified with a false positive rate lower or equal to 5%. The method releases coding sequences in their coding strand and coding frame, which allows their automatic translation into protein sequences with 95% confidence. The method is a natural consequence of the compositional bias of nucleotides in coding sequences.
Collapse
Affiliation(s)
- Nicolas Carels
- Fundação Oswaldo Cruz (FIOCRUZ), Instituto Oswaldo Cruz (IOC), Laboratório de Genômica Funcional e Bioinformática, Rio de Janeiro, RJ, Brazil
| | | |
Collapse
|
42
|
Li W, Freudenberg J. Two-parameter characterization of chromosome-scale recombination rate. Genome Res 2009; 19:2300-7. [PMID: 19752285 DOI: 10.1101/gr.092676.109] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The genome-wide recombination rate (RR) of a species is often described by one parameter, the ratio between total genetic map length (G) and physical map length (P), measured in centimorgans per megabase (cM/Mb). The value of this parameter varies greatly between species, but the cause for these differences is not entirely clear. A constraining factor of overall RR in a species, which may cause increased RR for smaller chromosomes, is the requirement of at least one chiasma per chromosome (or chromosome arm) per meiosis. In the present study, we quantify the relative excess of recombination events on smaller chromosomes by a linear regression model, which relates the genetic length of chromosomes to their physical length. We find for several species that the two-parameter regression, G = G(0) + k x P , provides a better characterization of the relationship between genetic and physical map length than the one-parameter regression that runs through the origin. A nonzero intercept (G(0)) indicates a relative excess of recombination on smaller chromosomes in a genome. Given G(0), the parameter k predicts the increase of genetic map length over the increase of physical map length. The observed values of G(0) have a similar magnitude for diverse species, whereas k varies by two orders of magnitude. The implications of this strategy for the genetic maps of human, mouse, rat, chicken, honeybee, worm, and yeast are discussed.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, New York 11030, USA.
| | | |
Collapse
|
43
|
Knoch TA, Göker M, Lohner R, Abuseiris A, Grosveld FG. Fine-structured multi-scaling long-range correlations in completely sequenced genomes--features, origin, and classification. EUROPEAN BIOPHYSICS JOURNAL: EBJ 2009; 38:757-79. [PMID: 19533117 PMCID: PMC2701493 DOI: 10.1007/s00249-009-0489-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/20/2009] [Revised: 05/05/2009] [Accepted: 05/13/2009] [Indexed: 11/26/2022]
Abstract
The sequential organization of genomes, i.e. the relations between distant base pairs and regions within sequences, and its connection to the three-dimensional organization of genomes is still a largely unresolved problem. Long-range power-law correlations were found using correlation analysis on almost the entire observable scale of 132 completely sequenced chromosomes of 0.5 × 106 to 3.0 × 107 bp from Archaea, Bacteria, Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster, and Homo sapiens. The local correlation coefficients show a species-specific multi-scaling behaviour: close to random correlations on the scale of a few base pairs, a first maximum from 40 to 3,400 bp (for Arabidopsis thaliana and Drosophila melanogaster divided in two submaxima), and often a region of one or more second maxima from 105 to 3 × 105 bp. Within this multi-scaling behaviour, an additional fine-structure is present and attributable to codon usage in all except the human sequences, where it is related to nucleosomal binding. Computer-generated random sequences assuming a block organization of genomes, the codon usage, and nucleosomal binding explain these results. Mutation by sequence reshuffling destroyed all correlations. Thus, the stability of correlations seems to be evolutionarily tightly controlled and connected to the spatial genome organization, especially on large scales. In summary, genomes show a complex sequential organization related closely to their three-dimensional organization.
Collapse
MESH Headings
- Algorithms
- Animals
- Arabidopsis/genetics
- Chromosomes/chemistry
- Chromosomes/genetics
- Chromosomes/ultrastructure
- Chromosomes, Fungal/chemistry
- Chromosomes, Fungal/genetics
- Chromosomes, Fungal/ultrastructure
- Chromosomes, Human/chemistry
- Chromosomes, Human/genetics
- Chromosomes, Human/ultrastructure
- Chromosomes, Plant/chemistry
- Chromosomes, Plant/genetics
- Chromosomes, Plant/ultrastructure
- Codon/chemistry
- Computer Simulation
- DNA/chemistry
- Drosophila melanogaster/genetics
- Genome
- Humans
- Models, Genetic
- Mutation
- Nucleosomes/chemistry
- Saccharomyces cerevisiae/genetics
- Schizosaccharomyces/genetics
- Sequence Analysis, DNA
Collapse
Affiliation(s)
- Tobias A Knoch
- Biophysical Genomics, Cell Biology and Genetics, Erasmus Medical Center, Rotterdam, The Netherlands.
| | | | | | | | | |
Collapse
|
44
|
Chaley MB, Nazipova NN, Kutyrkin VA. Statistical methods for detecting latent periodicity patterns in biological sequences: The case of small-size samples. PATTERN RECOGNITION AND IMAGE ANALYSIS 2009. [DOI: 10.1134/s1054661809020217] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
45
|
Halley JD, Burden FR, Winkler DA. Stem cell decision making and critical-like exploratory networks. Stem Cell Res 2009; 2:165-77. [DOI: 10.1016/j.scr.2009.03.001] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/13/2009] [Revised: 02/24/2009] [Accepted: 03/06/2009] [Indexed: 10/21/2022] Open
|
46
|
A hybrid technique for the periodicity characterization of genomic sequence data. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2009:924601. [PMID: 19365578 DOI: 10.1155/2009/924601] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2008] [Revised: 10/13/2008] [Accepted: 01/21/2009] [Indexed: 11/17/2022]
Abstract
Many studies of biological sequence data have examined sequence structure in terms of periodicity, and various methods for measuring periodicity have been suggested for this purpose. This paper compares two such methods, autocorrelation and the Fourier transform, using synthetic periodic sequences, and explains the differences in periodicity estimates produced by each. A hybrid autocorrelation-integer period discrete Fourier transform is proposed that combines the advantages of both techniques. Collectively, this representation and a recently proposed variant on the discrete Fourier transform offer alternatives to the widely used autocorrelation for the periodicity characterization of sequence data. Finally, these methods are compared for various tetramers of interest in C. elegans chromosome I.
Collapse
|
47
|
Application of information-theoretic tests for the analysis of DNA sequences based on Markov chain models. Comput Stat Data Anal 2009. [DOI: 10.1016/j.csda.2008.07.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
48
|
Varadwaj P, Purohit N, Arora B. Detection of Splice Sites Using Support Vector Machine. COMMUNICATIONS IN COMPUTER AND INFORMATION SCIENCE 2009. [DOI: 10.1007/978-3-642-03547-0_47] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
49
|
Paar V, Pavin N, Basar I, Rosandić M, Gluncić M, Paar N. Hierarchical structure of cascade of primary and secondary periodicities in Fourier power spectrum of alphoid higher order repeats. BMC Bioinformatics 2008; 9:466. [PMID: 18980673 PMCID: PMC2661002 DOI: 10.1186/1471-2105-9-466] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 11/03/2008] [Indexed: 11/28/2022] Open
Abstract
Background Identification of approximate tandem repeats is an important task of broad significance and still remains a challenging problem of computational genomics. Often there is no single best approach to periodicity detection and a combination of different methods may improve the prediction accuracy. Discrete Fourier transform (DFT) has been extensively used to study primary periodicities in DNA sequences. Here we investigate the application of DFT method to identify and study alphoid higher order repeats. Results We used method based on DFT with mapping of symbolic into numerical sequence to identify and study alphoid higher order repeats (HOR). For HORs the power spectrum shows equidistant frequency pattern, with characteristic two-level hierarchical organization as signature of HOR. Our case study was the 16 mer HOR tandem in AC017075.8 from human chromosome 7. Very long array of equidistant peaks at multiple frequencies (more than a thousand higher harmonics) is based on fundamental frequency of 16 mer HOR. Pronounced subset of equidistant peaks is based on multiples of the fundamental HOR frequency (multiplication factor n for nmer) and higher harmonics. In general, nmer HOR-pattern contains equidistant secondary periodicity peaks, having a pronounced subset of equidistant primary periodicity peaks. This hierarchical pattern as signature for HOR detection is robust with respect to monomer insertions and deletions, random sequence insertions etc. For a monomeric alphoid sequence only primary periodicity peaks are present. The 1/fβ – noise and periodicity three pattern are missing from power spectra in alphoid regions, in accordance with expectations. Conclusion DFT provides a robust detection method for higher order periodicity. Easily recognizable HOR power spectrum is characterized by hierarchical two-level equidistant pattern: higher harmonics of the fundamental HOR-frequency (secondary periodicity) and a subset of pronounced peaks corresponding to constituent monomers (primary periodicity). The number of lower frequency peaks (secondary periodicity) below the frequency of the first primary periodicity peak reveals the size of nmer HOR, i.e., the number n of monomers contained in consensus HOR.
Collapse
Affiliation(s)
- Vladimir Paar
- Faculty of Science, University of Zagreb, Bijenicka 32, Zagreb, Croatia.
| | | | | | | | | | | |
Collapse
|
50
|
Mena-Chalco JP, Carrer H, Zana Y, Cesar RM. Identification of protein coding regions using the modified Gabor-wavelet transform. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2008; 5:198-207. [PMID: 18451429 DOI: 10.1109/tcbb.2007.70259] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
An important topic in genomic sequence analysis is the identification of protein coding regions. In this context, several coding DNA model-independent methods, based on the occurrence of specific patterns of nucleotides at coding regions, have been proposed. Nonetheless, these methods have not been completely suitable due to their dependence on an empirically pre-defined window length required for a local analysis of a DNA region. We introduce a method, based on a modified Gabor-wavelet transform (MGWT), for the identification of protein coding regions. This novel transform is tuned to analyze periodic signal components and presents the advantage of being independent of the window length. We compared the performance of the MGWT with other methods using eukaryote datasets. The results show that the MGWT outperforms all assessed model-independent methods with respect to identification accuracy. These results indicate that the source of at least part of the identification errors produced by the previous methods is the fixed working scale. The new method not only avoids this source of errors, but also makes available a tool for detailed exploration of the nucleotide occurrence.
Collapse
Affiliation(s)
- Jesús P Mena-Chalco
- Departmento de Ciencia da Computação, Instituto de Matemática e Estatística de Universidade de São Paulo, Rua do Matão, Cidade Universitária, São Paulo, SP, Brasil.
| | | | | | | |
Collapse
|