1
|
Totterdell JA, Nur D, Mengersen KL. Bayesian hidden Markov models in DNA sequence segmentation using R: the case of Simian Vacuolating virus (SV40). J STAT COMPUT SIM 2017. [DOI: 10.1080/00949655.2017.1344666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
| | - Darfiana Nur
- School of Computer Science, Engineering and Mathematics, Flinders University, Tonsley, SA, Australia
| | - Kerrie L. Mengersen
- School of Mathematical Sciences, Queensland University of Technology and The Australian Research Council (ARC) Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS), Brisbane, QLD, Australia
| |
Collapse
|
2
|
Nikolaou C, Almirantis Y. “Word” Preference in the Genomic Text and Genome Evolution: Different Modes of n-tuplet Usage in Coding and Noncoding Sequences. J Mol Evol 2005; 61:23-35. [PMID: 16059753 DOI: 10.1007/s00239-004-0209-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2004] [Accepted: 02/02/2005] [Indexed: 10/25/2022]
Abstract
Extensive work on n-tuplet occurrence in genomic sequences has revealed the correlation of their usage with sequence origin. Parallel to that, there exist different restrictions in the nucleotide composition of coding and noncoding sequences that may result in distinct modes of usage of n-tuplets. The relatively simple approaches described herein focus on such differences. They are based on simple summation measures of n-tuplet frequencies, computed after filtering the background nucleotide composition. Among the main targets of this work is to draw some conclusions on the qualitative differences in the composition of genomic sequences depending on their functionality. Moreover, an evolutionary model is formulated, including simple forms of ubiquitous events of genome dynamics: genomic fusions, genome shuffling due to transpositions, replication slippage, and point mutations. This model is shown to be able to reproduce all the statistical features of genomic sequences discussed herein.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- Institute of Biology, National Research Center for Physical Sciences Demokritos,, 15310, Athens, Greece
| | | |
Collapse
|
3
|
Nikolaou C, Almirantis Y. Measuring the coding potential of genomic sequences through a combination of triplet occurrence patterns and RNY preference. J Mol Evol 2005; 59:309-16. [PMID: 15553086 DOI: 10.1007/s00239-004-2626-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The distribution of n-tuplet frequencies is shown to strongly correlate with functionality when examining a genomic sequence in a reading-frame specific manner. The approach described herein applies a coarse-graining procedure, which is able to reveal aspects of triplet usage that are related to protein coding, while at the same time remaining species independent, based on a simple summation of suitable triplet occurrences measures. These quantities are ratios of simple frequencies to suitable mononucleotide-frequency products promoting the incidence of the RNY motif, preferred in the most widely used codons. A significant distinction of coding and noncoding sequences is achieved.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- Institute of Biology, National Research Center for Physical Sciences Demokritos, Athens, Greece
| | | |
Collapse
|
4
|
Apostolico A, Bock ME, Lonardi S. Monotony of surprise and large-scale quest for unusual words. J Comput Biol 2004; 10:283-311. [PMID: 12935329 DOI: 10.1089/10665270360688020] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The problem of characterizing and detecting recurrent sequence patterns such as substrings or motifs and related associations or rules is variously pursued in order to compress data, unveil structure, infer succinct descriptions, extract and classify features, etc. In molecular biology, exceptionally frequent or rare words in bio-sequences have been implicated in various facets of biological function and structure. The discovery, particularly on a massive scale, of such patterns poses interesting methodological and algorithmic problems and often exposes scenarios in which tables and synopses grow faster and bigger than the raw sequences they are meant to encapsulate. In previous study, the ability to succinctly compute, store, and display unusual substrings has been linked to a subtle interplay between the combinatorics of the subword of a word and local monotonicities of some scores used to measure the departure from expectation. In this paper, we carry out an extensive analysis of such monotonicities for a broader variety of scores. This supports the construction of data structures and algorithms capable of performing global detection of unusual substrings in time and space linear in the subject sequences, under various probabilistic models.
Collapse
Affiliation(s)
- Alberto Apostolico
- Department of Computer Sciences, Purdue University, West Lafayette, IN 47907, USA.
| | | | | |
Collapse
|
5
|
Nikolaou C, Almirantis Y. Mutually symmetric and complementary triplets: differences in their use distinguish systematically between coding and non-coding genomic sequences. J Theor Biol 2003; 223:477-87. [PMID: 12875825 DOI: 10.1016/s0022-5193(03)00123-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The general property of asymmetry in word use in meaningful texts written in a variety of languages, motivates a quantification of the differences in the use of mutually symmetric triplets in genomic sequences. When this is done in the three reading frames, high values found for one of them are used as indication that the sequence is coding for a protein. Moreover, a similar quantification of the differences in the use of complementary triplets is introduced, again with predictive power of the coding character of a sequence. This method reflects the non-equivalence between sense and anti-sense strand of a coding segment. In both approaches, "linguistic asymmetry" in coding sequences is related to the form of the genetic code and to the bias in codon usage and amino acid use skews.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- National Research Center for Physical Sciences Demokritos, Institute of Biology, 15310 Athens, Greece
| | | |
Collapse
|
6
|
Styriak I, Pristas P, Javorský P. Lack of GATC sites in the genome of Streptococcus bovis bacteriophage F4. Res Microbiol 2000; 151:285-9. [PMID: 10875285 DOI: 10.1016/s0923-2508(00)00148-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
A strong bias against GATC sites was observed in the genome of phage F4, a lytic Streptococcus bovis bacteriophage. Only three GATC sites were found within the 60.4-kbp genome of this phage. The comparative lack of GATC sequences within the F4 genome was probably not due to dam methylation, as no modification within this site was detected using methylation-sensitive isoschizomer pair restriction endonuclease analysis. The short oligonucleotide composition of available S. bovis DNA sequences suggested the existence of an unknown mechanism for counterselection of GATC sites in S. bovis bacteriophages.
Collapse
Affiliation(s)
- I Styriak
- Institute of Animal Physiology of the Slovak Academy of Sciences, Kosice.
| | | | | |
Collapse
|
7
|
van Helden J, del Olmo M, Pérez-Ortín JE. Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals. Nucleic Acids Res 2000; 28:1000-10. [PMID: 10648794 PMCID: PMC102588 DOI: 10.1093/nar/28.4.1000] [Citation(s) in RCA: 108] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/1999] [Revised: 12/22/1999] [Accepted: 12/22/1999] [Indexed: 11/14/2022] Open
Abstract
The study of a few genes has permitted the identification of three elements that constitute a yeast polyadenyl-ation signal: the efficiency element (EE), the positioning element and the actual site for cleavage and poly-adenyl-ation. In this paper we perform an analysis of oligonucleotide composition on the sequences located downstream of the stop codon of all yeast genes. Several oligonucleotide families appear over-represented with a high significance (referred to herein as 'words'). The family with the highest over-representation includes the oligonucleotides shown experimentally to play a role as EEs. The word with the highest score is TATATA, followed, among others, by a series of single-nucleotide variants (TATGTA, TACATA, TAAATA.) and one-letter shifts (ATATAT). A position analysis reveals that those words have a high preference to be in 3' flanks of yeast genes and there they have a very uneven distribution, with a marked peak around 35 bp after the stop codon. Of the predicted ORFs, 85% show one or more of those sequences. Similar results were obtained using a data set of EST sequences. Other clusters of over-represented words are also detected, namely T- and A-rich signals. Using these results and previously known data we propose a general model for the 3' trailers of yeast mRNAs.
Collapse
Affiliation(s)
- J van Helden
- Unité de Conformation des Macromolécules Biologiques, Université Libre de Bruxelles, CP 160/16, 50 avenue F.D. Roosevelt, B-1050 Bruxelles, Belgium.
| | | | | |
Collapse
|
8
|
Abstract
In this paper, we give an overview about the different results existing on the statistical distribution of word counts in a Markovian sequence of letters. Results concerning the number of overlapping occurrences, the number of renewals and the number of clumps will be presented. Counts of single words and also multiple words are considered. Most of the results are approximations as the length of the sequence tends to infinity. We will see that Gaussian approximations switch to (compound) Poisson approximations for rare words. Modeling DNA sequences or proteins by stationary Markov chains, these results can be used to study the statistical frequency of motifs in a given sequence.
Collapse
Affiliation(s)
- S Schbath
- Institut National de la Recherche Agronomique, Unité de Biométrie, Jouy-en-Josas, France.
| |
Collapse
|
9
|
Lobzin VV, Chechetkin VR. Order and correlations in genomic DNA sequences. The spectral approach. ACTA ACUST UNITED AC 2000. [DOI: 10.3367/ufnr.0170.200001c.0057] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
10
|
van Helden J, André B, Collado-Vides J. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J Mol Biol 1998; 281:827-42. [PMID: 9719638 DOI: 10.1006/jmbi.1998.1947] [Citation(s) in RCA: 409] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We present here a simple and fast method allowing the isolation of DNA binding sites for transcription factors from families of coregulated genes, with results illustrated in Saccharomyces cerevisiae. Although conceptually simple, the algorithm proved efficient for extracting, from most of the yeast regulatory families analyzed, the upstream regulatory sequences which had been previously found by experimental analysis. Furthermore, putative new regulatory sites are predicted within upstream regions of several regulons. The method is based on the detection of over-represented oligonucleotides. A specificity of this approach is to define the statistical significance of a site based on tables of oligonucleotide frequencies observed in all non-coding sequences from the yeast genome. In contrast with heuristic methods, this oligonucleotide analysis is rigorous and exhaustive. Its range of detection is however limited to relatively simple patterns: short motifs with a highly conserved core. These features seem to be shared by a good number of regulatory sites in yeast. This, and similar methods, should be increasingly required to identify unknown regulatory elements within the numerous new coregulated families resulting from measurements of gene expression levels at the genomic scale. All tools described here are available on the web at the site http://copan.cifn.unam.mx/Computational_Biology/ yeast-tools
Collapse
Affiliation(s)
- J van Helden
- Centro de Investigación sobre Fijación de Nitrógeno, Universidad Nacional Autónoma de México, AP565A Cuernavaca, Morelos, 62100, México.
| | | | | |
Collapse
|
11
|
Pesole G, Attimonelli M, Saccone C. Linguistic analysis of nucleotide sequences: algorithms for pattern recognition and analysis of codon strategy. Methods Enzymol 1996; 266:281-94. [PMID: 8743690 DOI: 10.1016/s0076-6879(96)66019-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Affiliation(s)
- G Pesole
- Dipartmento di Biochimica e Biologia Molecolare, Università di Bari, Italy
| | | | | |
Collapse
|
12
|
Schbath S, Prum B, de Turckheim E. Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J Comput Biol 1995; 2:417-37. [PMID: 8521272 DOI: 10.1089/cmb.1995.2.417] [Citation(s) in RCA: 69] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
Identifying exceptional motifs is often used for extracting information from long DNA sequences. The two difficulties of the method are the choice of the model that defines the expected frequencies of words and the approximation of the variance of the difference T(W) between the number of occurrences of a word W and its estimation. We consider here different Markov chain models, either with stationary or periodic transition probabilities. We estimate the variance of the difference T(W) by the conditional variance of the number of occurrences of W given the oligonucleotides counts that define the model. Two applications show how to use asymptotically standard normal statistics associated with the counts to describe a given sequence in terms of its outlying words. Sequences of Escherichia coli and of Bacillus subtilis are compared with respect to their exceptional tri- and tetranucleotides. For both bacteria, exceptional 3-words are mainly found in the coding frame. E. coli palindrome counts are analyzed in different models, showing that many overabundant words are one-letter mutations of avoided palindromes.
Collapse
Affiliation(s)
- S Schbath
- INRA, Département de Biométrie et Intelligence Artificielle, Jouy-en-Josas, France
| | | | | |
Collapse
|
13
|
Pesole G, Attimonelli M, Saccone C. Linguistic approaches to the analysis of sequence information. Trends Biotechnol 1994; 12:401-8. [PMID: 7765386 DOI: 10.1016/0167-7799(94)90028-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Biological macromolecules have many features that resemble modern languages. Thus, linguistic approaches to the analysis of sequence information are becoming powerful tools for deciphering genetic texts. The methodologies used, to date, to determine the global parameters of the genetic language and meaningful patterns within it are described.
Collapse
Affiliation(s)
- G Pesole
- Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, Italy
| | | | | |
Collapse
|
14
|
Abstract
Nucleotide and amino acid sequences can be analyzed and compared by their oligomer compositions. Such methods are fundamentally different from comparison methods based on sequence alignment. They are analogous to the linguistic analysis of human texts. The methods have a wide range of sensitivity and can identify homologous as well as functionally and taxonomically related sequences. Significant sequence dissimilarity can also be identified enabling detection of foreign DNA sequences in genomes, genetic libraries and databases. The simplicity and speed of linguistic methods make them very suitable for database searching and maintenance and as a preliminary step to more specific and time-consuming analysis methods.
Collapse
Affiliation(s)
- S Pietrokovski
- Department of Structural Biology, Weizmann Institute of Science, Rehovot, Israel
| |
Collapse
|
15
|
|
16
|
Abstract
We improved an already existing formula for calculating the probability of occurrence of specific oligomers (Grob & Stüber, 1987) by taking into account unequal base distribution. This method identifies specific oligomers in a given sequence as candidates for biological signals.
Collapse
Affiliation(s)
- E E Stückle
- Max-Planck-Institut für Immunbiologie, Freiburg, Germany
| | | | | |
Collapse
|
17
|
Abstract
Statistical approaches help in the determination of significant configurations in protein and nucleic acid sequence data. Three recent statistical methods are discussed: (i) score-based sequence analysis that provides a means for characterizing anomalies in local sequence text and for evaluating sequence comparisons; (ii) quantile distributions of amino acid usage that reveal general compositional biases in proteins and evolutionary relations; and (iii) r-scan statistics that can be applied to the analysis of spacings of sequence markers.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, CA 94305
| | | |
Collapse
|
18
|
Pesole G, Prunella N, Liuni S, Attimonelli M, Saccone C. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res 1992; 20:2871-5. [PMID: 1614873 PMCID: PMC336935 DOI: 10.1093/nar/20.11.2871] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
We present here a fast and sensitive method designed to isolate short nucleotide sequences which have non-random statistical properties and may thus be biologically active. It is based on a first order Markov analysis and allows us to detect statistically significant sequence motifs from six to ten nucleotides long which are significantly shared (or avoided) in the sequences under investigation. This method has been tested on a set of 521 sequences extracted from the Eukaryotic Promoter Database (2). Our results demonstrate the accuracy and the efficiency of the method in that the sequence motifs which are known to act as eukaryotic promoters, such as the TATA-box and the CAAT-box, were clearly identified. In addition we have found other statistically significant motifs, the biological roles of which are yet to be clarified.
Collapse
Affiliation(s)
- G Pesole
- Dipartimento di Biochimica e Biologia Molecolare, Università di Bari, Italy
| | | | | | | | | |
Collapse
|
19
|
Nussinov R. DNA sequences at and between the GC and TATA boxes: potential DNA looping and spatial juxtapositioning of the protein factors. J Biomol Struct Dyn 1992; 9:1213-37. [PMID: 1637510 DOI: 10.1080/07391102.1992.10507988] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Regulation of gene expression in eukaryotes involves a complex assembly of DNA recognition sequence elements and their respective protein factors. The upstream promoter/enhancer sequences are position and orientation independent. Despite their variable distances from the TATA box and transcription start site, interaction between the protein activators and TATA general transcription factors takes place, enabling induced levels of transcription initiation. Here the intervening sequences between the GC and TATA boxes are examined as functions of their lengths. Regardless of the substantial differences in the spacer sizes, similar mono and dinucleotide distributions are noted. Purine-purine base pair steps, except for AA, are more frequent at and near the GC box in the 5' ends of the loops than in their 3' ends. Pyrimidine-pyrimidine base pair steps, except for TT behave similarly. AT and TA (as well as AA and TT) are more frequent in the 3' ends of the loops near the TATA. Examination of these distributions, as well as of the sequences composing the GC and TATA boxes indicates that the DNA in the upstream part of the loop is more rigid, whereas the downstream regions are far more flexible. The flexibility of the general TATA region may afford correct spatial juxtapositioning of the proteins with respect to each other, enabling interactions between the activators and the general transcription factors.
Collapse
Affiliation(s)
- R Nussinov
- Laboratory of Mathematical Biology, NCI-Frederick Cancer Research and Developmental Center, Maryland 21702-1201
| |
Collapse
|
20
|
|
21
|
Karlin S, Burge C, Campbell AM. Statistical analyses of counts and distributions of restriction sites in DNA sequences. Nucleic Acids Res 1992; 20:1363-70. [PMID: 1313968 PMCID: PMC312184 DOI: 10.1093/nar/20.6.1363] [Citation(s) in RCA: 94] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Counts and spacings of all 4- and 6-bp palindromes in DNA sequences from a broad range of organisms were investigated. Both 4- and 6-bp average palindrome counts were significantly low in all bacteriophages except one, probably as a means of avoiding restriction enzyme cleavage. The exception, T4 of normal 4- and 6-palindrome counts, putatively derives protection from modification of cytosine to hydroxymethylcytosine plus glycosylation. The counts and distributions of 4-bp and of 6-bp restriction sites in bacterial species are variable. Bacterial cells with multiple restriction systems for 4-bp or 6-bp target specificities are low in aggregate 4- or 6-bp palindrome counts/kb, respectively, but bacterial cells lacking exact 4-cutter enzymes generally show normal or high counts of 4-bp palindromes when compared with random control sequences of comparable nucleotide frequencies. For example, E. coli, apparently without an exact 4-bp target restriction endonuclease (see text), contains normal aggregate 4-palindrome counts/kb, while B. subtilis, which abounds with 4-bp restriction systems, shows a significant under-representation of 4-palindrome counts. Both E. coli and B. subtilis have many 6-bp restriction enzymes and concomitantly diminished aggregate 6-palindrome counts/kb. Eukaryote, viral, and organelle sequences generally have aggregate 4- and 6-palindromic counts/kb in the normal range. Interpretations of these results are given in terms of restriction/methylation regimes, recombination and transcription processes, and possible structural and regulatory roles of 4- and 6-bp palindromes.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, CA 94305
| | | | | |
Collapse
|
22
|
Pizzi E, Attimonelli M, Liuni S, Frontali C, Saccone C. A simple method for global sequence comparison. Nucleic Acids Res 1992; 20:131-6. [PMID: 1738591 PMCID: PMC310336 DOI: 10.1093/nar/20.1.131] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
A simple method of sequence comparison, based on a correlation analysis of oligonucleotide frequency distributions, is here shown to be a reliable test of overall sequence similarity. The method does not involve sequence alignment procedures and permits the rapid screening of large amounts of sequence data. It identifies those sequences which deserve more careful analysis of sequence similarity at the level of resolution of the single nucleotide. It uses observed quantities only and does not involve the adoption of any theoretical model.
Collapse
Affiliation(s)
- E Pizzi
- Centro Studi Mitocondri e Metabolismo Energetico CNR, Dipartimento di Biochimica e Biologia Molecolare, University of Bari, Italy
| | | | | | | | | |
Collapse
|
23
|
Abstract
A statistical method based on r-fragments, sums of distances between (r + 1) consecutive restriction enzyme sites, is introduced for detecting nonrandomness in the distribution or too markers in sequence data. The technique is applicable whenever large numbers of markers are available and will detect clumping, excessive dispersion or too much evenness of spacing of the markers. It is particularly adapted to varying the scale on which inhomogeneities can be detected, from nearest neighbor interactions to more distant interactions. The r-fragment procedure is applied primarily to the Kohara et al. (1) physical map of E. coli. Other applications to DAM methylation sites in E. coli and NotI sites in human chromosome 21 are presented. Restriction sites for the eight enzymes used in (1) appear to be randomly distributed, although at widely differing densities. These conclusions are substantially in agreement with the analysis of Churchill et al. (3). Extreme variability in the density of the eight restriction enzyme sites cannot be explained by variability in mono-, di- or trinucleotide frequencies.
Collapse
Affiliation(s)
- S Karlin
- Department of Mathematics, Stanford University, CA 94305
| | | |
Collapse
|