51
|
Menconi G, Benci V, Buiatti M. Data compression and genomes: a two-dimensional life domain map. J Theor Biol 2008; 253:281-8. [PMID: 18430439 DOI: 10.1016/j.jtbi.2008.03.011] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2007] [Revised: 03/10/2008] [Accepted: 03/11/2008] [Indexed: 10/22/2022]
Abstract
We define the complexity of DNA sequences as the information content per nucleotide, calculated by means of some Lempel-Ziv data compression algorithm. It is possible to use the statistics of the complexity values of the functional regions of different complete genomes to distinguish among genomes of different domains of life (Archaea, Bacteria and Eukarya). We shall focus on the distribution function of the complexity of non-coding regions. We show that the three domains may be plotted in separate regions within the two-dimensional space where the axes are the skewness coefficient and the curtosis coefficient of the aforementioned distribution. Preliminary results on 15 genomes are introduced.
Collapse
Affiliation(s)
- Giulia Menconi
- Dipartimento di Matematica Applicata, Università di Pisa, Via Buonarroti 1C-56127, Pisa, Italy.
| | | | | |
Collapse
|
52
|
Model of perfect tandem repeat with random pattern and empirical homogeneity testing poly-criteria for latent periodicity revelation in biological sequences. Math Biosci 2008; 211:186-204. [DOI: 10.1016/j.mbs.2007.10.008] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2007] [Revised: 10/19/2007] [Accepted: 10/26/2007] [Indexed: 11/23/2022]
|
53
|
Abstract
Fast-sequencing throughput methods have increased the number of completely sequenced bacterial genomes to about 400 by December 2006, with the number increasing rapidly. These include several strains. In silico methods of comparative genomics are of use in categorizing and phylogenetically sorting these bacteria. Various word-based tools have been used for quantifying the similarities and differences between entire genomes. The simple di-nucleotide frequency comparison, codon specificity and k-mer repeat detection are among some of the well-known methods. In this paper, we show that the Mutual Information function, which is a measure of correlations and a concept from Information Theory, is very effective in determining the similarities and differences among genome sequences of various strains of bacteria such as the plant pathogen Xylella fastidiosa, marine Cyanobacteria Prochlorococcus marinus or animal and human pathogens such as species of Ehrlichia and Legionella. The short-range three-base periodicity, small sequence repeats and long-range correlations taken together constitute a genome signature that can be used as a technique for identifying new bacterial strains with the help of strains already catalogued in the database. There have been several applications of using the Mutual Information function as a measure of correlations in genomics but this is the first whole genome analysis done to detect strain similarities and differences.
Collapse
Affiliation(s)
- D Swati
- Department of Physics, MMV, Banaras Hindu University, Varanasi 221005, India.
| |
Collapse
|
54
|
Martignetti L, Caselle M. Universal power law behaviors in genomic sequences and evolutionary models. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 76:021902. [PMID: 17930060 DOI: 10.1103/physreve.76.021902] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/20/2007] [Indexed: 05/25/2023]
Abstract
We study the length distribution of a particular class of DNA sequences known as the 5' untranslated regions exons. These exons belong to the messenger RNA of protein coding genes, but they are not coding (they are located upstream of the coding portion of the mRNA) and are thus less constrained from an evolutionary point of view. We show that in both mice and humans these exons show a very clean power law decay in their length distribution and suggest a simple evolutionary model, which may explain this finding. We conjecture that this power law behavior could indeed be a general feature of higher eukaryotes.
Collapse
Affiliation(s)
- Loredana Martignetti
- Dipartimento di Fisica Teoric, Università di Torino and INFN, Via Pietro Giuria 1, I-10125 Torino, Italy.
| | | |
Collapse
|
55
|
Oiwa NN. The nucleotide sequence and the local electronic structure. JOURNAL OF PHYSICS. CONDENSED MATTER : AN INSTITUTE OF PHYSICS JOURNAL 2007; 19:181001. [PMID: 21690977 DOI: 10.1088/0953-8984/19/18/181001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Affiliation(s)
- Nestor Norio Oiwa
- Núcleo de Neurociências e Comportamento, Departamento de Psicologia Experimental, Instituto de Psicologia, Universidade de São Paulo, Brazil. Departamento de Física Geral, Instituto de Física, Universidade de São Paulo, Brazil
| |
Collapse
|
56
|
Nicolay S, Brodie Of Brodie EB, Touchon M, Audit B, d'Aubenton-Carafa Y, Thermes C, Arneodo A. Bifractality of human DNA strand-asymmetry profiles results from transcription. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2007; 75:032902. [PMID: 17500744 DOI: 10.1103/physreve.75.032902] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2006] [Indexed: 05/15/2023]
Abstract
We use the wavelet transform modulus maxima method to investigate the multifractal properties of strand-asymmetry DNA walk profiles in the human genome. This study reveals the bifractal nature of these profiles, which involve two competing scale-invariant (up to repeat-masked distances less, or similar 40 kbp) components characterized by Hölder exponents h{1}=0.78 and h{2}=1, respectively. The former corresponds to the long-range-correlated homogeneous fluctuations previously observed in DNA walks generated with structural codings. The latter is associated with the presence of jumps in the original strand-asymmetry noisy signal S. We show that a majority of upward (downward) jumps co-locate with gene transcription start (end) sites. Here 7228 human gene transcription start sites from the refGene database are found within 2 kbp from an upward jump of amplitude DeltaS > or = 0.1 which suggests that about 36% of annotated human genes present significant transcription-induced strand asymmetry and very likely high expression rate.
Collapse
Affiliation(s)
- S Nicolay
- Laboratoire Joliot-Curie and Laboratoire de Physique, UMR 5672, CNRS, ENS-Lyon, 46 Allée d'Italie, 69364 Lyon Cedex 07, France
| | | | | | | | | | | | | |
Collapse
|
57
|
Cattani C, D'Auria CR. Correlations in DNA sequences. JOURNAL OF INFORMATION & OPTIMIZATION SCIENCES 2007. [DOI: 10.1080/02522667.2007.10699728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
58
|
Li W, Miramontes P. Large-scale oscillation of structure-related DNA sequence features in human chromosome 21. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 74:021912. [PMID: 17025477 DOI: 10.1103/physreve.74.021912] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2006] [Indexed: 05/12/2023]
Abstract
Human chromosome 21 is the only chromosome in the human genome that exhibits oscillation of the (G+C) content of a cycle length of hundreds kilobases (kb) ( 500 kb near the right telomere). We aim at establishing the existence of a similar periodicity in structure-related sequence features in order to relate this (G+C)% oscillation to other biological phenomena. The following quantities are shown to oscillate with the same 500 kb periodicity in human chromosome 21: binding energy calculated by two sets of dinucleotide-based thermodynamic parameters, AA/TT and AAA/TTT bi- and tri-nucleotide density, 5'-TA-3' dinucleotide density, and signal for 10- or 11-base periodicity of AA/TT or AAA/TTT. These intrinsic quantities are related to structural features of the double helix of DNA molecules, such as base-pair binding, untwisting or unwinding, stiffness, and a putative tendency for nucleosome formation.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, Feinstein Institute for Medical Research, North Shore LIJ Health System, 350 Community Drive, Manhasset, New York 11030, USA.
| | | |
Collapse
|
59
|
Shih CT. Characteristic length scale of electric transport properties of genomes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2006; 74:010903. [PMID: 16907054 DOI: 10.1103/physreve.74.010903] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/04/2005] [Indexed: 05/11/2023]
Abstract
A tight-binding model together with a statistical method are used to investigate the relation between the sequence-dependent electric transport properties and the sequences of protein-coding regions of complete genomes. A correlation parameter Omega is defined to analyze the relation. For some particular propagation length w max, the transport behaviors of the coding and noncoding sequences are very different and the correlation reaches its maximal value Omega max. w max and Omega max are characteristic values for each species. A possible reason for the difference between the features of transport properties in the coding and noncoding regions is the mechanism of DNA damage repair processes together with natural selection.
Collapse
Affiliation(s)
- C T Shih
- Department of Physics, Tunghai University, Taichung, Taiwan
| |
Collapse
|
60
|
Bailly-Bechet M, Danchin A, Iqbal M, Marsili M, Vergassola M. Codon usage domains over bacterial chromosomes. PLoS Comput Biol 2006; 2:e37. [PMID: 16683018 PMCID: PMC1447655 DOI: 10.1371/journal.pcbi.0020037] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2005] [Accepted: 03/13/2006] [Indexed: 11/19/2022] Open
Abstract
The geography of codon bias distributions over prokaryotic genomes and its impact upon chromosomal organization are analyzed. To this aim, we introduce a clustering method based on information theory, specifically designed to cluster genes according to their codon usage and apply it to the coding sequences of Escherichia coli and Bacillus subtilis. One of the clusters identified in each of the organisms is found to be related to expression levels, as expected, but other groups feature an over-representation of genes belonging to different functional groups, namely horizontally transferred genes, motility, and intermediary metabolism. Furthermore, we show that genes with a similar bias tend to be close to each other on the chromosome and organized in coherent domains, more extended than operons, demonstrating a role of translation in structuring bacterial chromosomes. It is argued that a sizeable contribution to this effect comes from the dynamical compartimentalization induced by the recycling of tRNAs, leading to gene expression rates dependent on their genomic and expression context.
Collapse
Affiliation(s)
- Marc Bailly-Bechet
- CNRS URA 2171, Institute Pasteur, Unité Génétique in silico, Paris, France
| | - Antoine Danchin
- CNRS URA 2171, Institute Pasteur, Unité Génétique des Génomes Bactériens, Paris, France
| | - Mudassar Iqbal
- Abdus Salam International Center Theoretical Physics, Trieste, Italy
- Computing Laboratory, University of Kent, Canterbury, Kent, United Kingdom
| | - Matteo Marsili
- Abdus Salam International Center Theoretical Physics, Trieste, Italy
| | - Massimo Vergassola
- CNRS URA 2171, Institute Pasteur, Unité Génétique in silico, Paris, France
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
61
|
Ouyang Z, Liu JK, She ZS. Hierarchical structure analysis describing abnormal base composition of genomes. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2005; 72:041915. [PMID: 16383428 DOI: 10.1103/physreve.72.041915] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2004] [Indexed: 05/05/2023]
Abstract
Abnormal base compositional patterns of genomic DNA sequences are studied in the framework of a hierarchical structure (HS) model originally proposed for the study of fully developed turbulence [She and Lévêque, Phys. Rev. Lett. 72, 336 (1994)]. The HS similarity law is verified over scales between 10(3)bp and 10(5)bp, and the HS parameter beta is proposed to describe the degree of heterogeneity in the base composition patterns. More than one hundred bacteria, archaea, virus, yeast, and human genome sequences have been analyzed and the results show that the HS analysis efficiently captures abnormal base composition patterns, and the parameter beta is a characteristic measure of the genome. Detailed examination of the values of beta reveals an intriguing link to the evolutionary events of genetic material transfer. Finally, a sequence complexity (S) measure is proposed to characterize gradual increase of organizational complexity of the genome during the evolution. The present study raises several interesting issues in the evolutionary history of genomes.
Collapse
Affiliation(s)
- Zhengqing Ouyang
- State Key Lab for Turbulence and Complex Systems and Center for Theoretical Biology, Peking University, Beijing 100871, People's Republic of China
| | | | | |
Collapse
|
62
|
Abstract
Using the complete genome of Plasmodium falciparum 3D7 which has 14 chromosomes as an example, we have examined the distribution functions for the amount of C or G and A or T consecutively and non-overlapping blocks of m bases in this system. The function P(S) about the number of the consecutive C-G or A-T content cluster conforms to the relation P(S) proportional, variante(-alphas); values of the scaling exponent alpha(CG) are much larger than alpha(AT); and alpha(AT) of 14 chromosomes are hardly changed, whereas alpha(CG) of 14 chromosomes have a number of fluctuations. We found maximum value of A-T cluster size is much larger than C-G, which implies the existence of large A-T cluster. Our study of the width function xi(m) of cluster C-G content showed that follows good power law xi(m) proportional, variantm(-gamma). The average gamma for 14 chromosomes is 0.931. These investigations provide some insight into the nucleotide clusters of DNA sequences, and help us understand other properties of DNA sequences.
Collapse
Affiliation(s)
- Jun Cheng
- Department of Physics, Jinhua University, Jinhua 321017, China.
| | | |
Collapse
|
63
|
Discussion of "A Bayesian Approach to DNASequence Segmentation ". Biometrics 2005. [DOI: 10.1111/j.0006-341x.2005.040701_2.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
64
|
Messer PW, Arndt PF, Lässig M. Solvable sequence evolution models and genomic correlations. PHYSICAL REVIEW LETTERS 2005; 94:138103. [PMID: 15904043 DOI: 10.1103/physrevlett.94.138103] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2004] [Indexed: 05/02/2023]
Abstract
We study a minimal model for genome evolution whose elementary processes are single site mutation, duplication and deletion of sequence regions, and insertion of random segments. These processes are found to generate long-range correlations in the composition of letters as long as the sequence length is growing; i.e., the combined rates of duplications and insertions are higher than the deletion rate. For constant sequence length, on the other hand, all initial correlations decay exponentially. These results are obtained analytically and by simulations. They are compared with the long-range correlations observed in genomic DNA, and the implications for genome evolution are discussed.
Collapse
Affiliation(s)
- Philipp W Messer
- Institute for Theoretical Physics, University of Cologne, Köln, Germany
| | | | | |
Collapse
|
65
|
Ouyang Z, Zhu H, Wang J, She ZS. Multivariate entropy distance method for prokaryotic gene identification. J Bioinform Comput Biol 2004; 2:353-73. [PMID: 15297987 DOI: 10.1142/s0219720004000624] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/10/2003] [Indexed: 11/18/2022]
Abstract
A new simple method is found for efficient and accurate identification of coding sequences in prokaryotic genome. The method employs a Shannon description of artificial language for DNA sequences. It consists in translating a DNA sequence into a pseudo-amino acid sequence with 20 fundamental words according to the universal genetic code. With an entropy-density profile (EDP), the method maps a sequence of finite length to a vector and then analyzes its position in the 20-dimensional phase space depending on its nature. It is found that the ratio of the relative distance to an averaged coding and non-coding EDP over a small number (up to one) of open reading frames (ORFs) can serve as a good coding potential. An iterative algorithm is designed for finding a set of "root" sequences using this coding potential. A multivariate entropy distance (MED) algorithm is then proposed for the identification of prokaryotic genes; it has a feature to combine the use of a coding potential and an EDP-based sequence similarity analysis. The current version of MED is unsupervised, parameter-free and simple to implement. It is demonstrated to be able to detect 95-99% genes with 10-30% of additional genes when tested against the RefSeq database of NCBI and to detect 97.5-99.8% of confirmed genes with known functions. It is also shown to be able to find a set of (functionally known) genes that are missed by other well-known gene finding algorithms. All measurements show that the MED algorithm reaches a similar performance level as the algorithms like GeneMark and Glimmer for prokaryotic gene prediction.
Collapse
Affiliation(s)
- Zhengqing Ouyang
- State Key Lab for Turbulence and Complex Systems and Center for Theoretical Biology, Peking University, Beijing 100871, China
| | | | | | | |
Collapse
|
66
|
Afreixo V, Ferreira PJSG, Santos D. Spectrum and symbol distribution of nucleotide sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2004; 70:031910. [PMID: 15524552 DOI: 10.1103/physreve.70.031910] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2004] [Indexed: 05/24/2023]
Abstract
This paper explores the connection between the size of the spectral coefficients of a nucleotide or any other symbolic sequence and the distribution of nucleotides along certain subsequences. It explains the connection between the nucleotide distribution and the size of the spectral coefficients, and gives a necessary and sufficient condition for a coefficient to have a prescribed magnitude. Furthermore, it gives a fast algorithm for computing the value of a given spectral coefficient of a nucleotide sequence, discussing periods 3 and 4 as examples. Finally, it shows that the spectrum of a symbolic sequence is redundant, in the sense that there exists a linear recursion that determines the values of all the coefficients from those of a subset.
Collapse
Affiliation(s)
- Vera Afreixo
- Departamento de Electrónica e Telecomunicaçōes/IEETA, Universidade de Aveiro, 3810-193 Aveiro, Portugal
| | | | | |
Collapse
|
67
|
Ouyang Z, Wang C, She ZS. Scaling and hierarchical structures in DNA sequences. PHYSICAL REVIEW LETTERS 2004; 93:078103. [PMID: 15324281 DOI: 10.1103/physrevlett.93.078103] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2003] [Indexed: 05/24/2023]
Abstract
A method of analyzing DNA correlation structure is introduced. Density fluctuations of nucleotides are shown to display an extended self-similarity scaling when the scale varies between 100 and 8000 base pairs. The scaling is accurately described by a hierarchical structure model of She and Leveque [Phys. Rev. Lett. 72, 336 (1994)]]. The derived model parameter beta is able to quantify moderately large-scale correlations which exist in a true DNA sequence but are absent in its randomly shuffled sequence and in a simulated model sequence by an evolution model of Hsieh et al. [Phys. Rev. Lett. 90, (2003)]]. Finally, it is shown that beta varies with the evolution category and measures the organizational complexity of the genome.
Collapse
Affiliation(s)
- Zhengqing Ouyang
- State Key Lab for Turbulence and Complex Systems and Center for Theoretical Biology, Peking University, Beijing 100871, China
| | | | | |
Collapse
|
68
|
Allahverdyan AE, Gevorkian ZS, Hu CK, Wu MC. Unzipping of DNA with correlated base sequence. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2004; 69:061908. [PMID: 15244618 DOI: 10.1103/physreve.69.061908] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/17/2002] [Revised: 01/07/2004] [Indexed: 05/24/2023]
Abstract
We consider force-induced unzipping transition for a heterogeneous DNA model with a correlated base sequence. Both finite-range and long-range correlated situations are considered. It is shown that finite-range correlations increase stability of DNA with respect to the external unzipping force. Due to long-range correlations the number of unzipped base pairs displays two widely different scenarios depending on the details of the base sequence: either there is no unzipping phase transition at all, or the transition is realized via a sequence of jumps with magnitude comparable to the size of the system. Both scenarios are different from the behavior of the average number of unzipped base pairs (non-self-averaging). The results can be relevant for explaining the biological purpose of correlated structures in DNA.
Collapse
Affiliation(s)
- A E Allahverdyan
- Institute for Theoretical Physics, Valckenierstraat 65, 1018 XE Amsterdam, The Netherlands.
| | | | | | | |
Collapse
|
69
|
Audit B, Vaillant C, Arnéodo A, d'Aubenton-Carafa Y, Thermes C. Wavelet Analysis of DNA Bending Profiles reveals Structural Constraints on the Evolution of Genomic Sequences. J Biol Phys 2004; 30:33-81. [PMID: 23345861 PMCID: PMC3456503 DOI: 10.1023/b:jobp.0000016438.86794.8e] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Analyses of genomic DNA sequences have shown in previous works that base pairs are correlated at large distances with scale-invariant statistical properties. We show in the present study that these correlations between nucleotides (letters) result in fact from long-range correlations (LRC) between sequence-dependent DNA structural elements (words) involved in the packaging of DNA in chromatin. Using the wavelet transform technique, we perform a comparative analysis of the DNA text and of the corresponding bending profiles generated with curvature tables based on nucleosome positioning data. This exploration through the optics of the so-called `wavelet transform microscope' reveals a characteristic scale of 100-200 bp that separates two regimes of different LRC. We focus here on the existence of LRC in the small-scale regime (≲ 200 bp). Analysis of genomes in the three kingdoms reveals that this regime is specifically associated to the presence of nucleosomes. Indeed, small scale LRC are observed in eukaryotic genomes and to a less extent in archaeal genomes, in contrast with their absence in eubacterial genomes. Similarly, this regime is observed in eukaryotic but not in bacterial viral DNA genomes. There is one exception for genomes of Poxviruses, the only animal DNA viruses that do not replicate in the cell nucleus and do not present small scale LRC. Furthermore, no small scale LRC are detected in the genomes of all examined RNA viruses, with one exception in the case of retroviruses. Altogether, these results strongly suggest that small-scale LRC are a signature of the nucleosomal structure. Finally, we discuss possible interpretations of these small-scale LRC in terms of the mechanisms that govern the positioning, the stability and the dynamics of the nucleosomes along the DNA chain. This paper is maily devoted to a pedagogical presentation of the theoretical concepts and physical methods which are well suited to perform a statistical analysis of genomic sequences. We review the results obtained with the so-called wavelet-based multifractal analysis when investigating the DNA sequences of various organisms in the three kingdoms. Some of these results have been announced in B. Audit et al. [1, 2].
Collapse
Affiliation(s)
- Benjamin Audit
- Centre de Recherche Paul Pascal, avenue Schweitzer, 33600 Pessac, France
| | | | | | | | | |
Collapse
|
70
|
Nikolaou C, Almirantis Y. Mutually symmetric and complementary triplets: differences in their use distinguish systematically between coding and non-coding genomic sequences. J Theor Biol 2003; 223:477-87. [PMID: 12875825 DOI: 10.1016/s0022-5193(03)00123-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The general property of asymmetry in word use in meaningful texts written in a variety of languages, motivates a quantification of the differences in the use of mutually symmetric triplets in genomic sequences. When this is done in the three reading frames, high values found for one of them are used as indication that the sequence is coding for a protein. Moreover, a similar quantification of the differences in the use of complementary triplets is introduced, again with predictive power of the coding character of a sequence. This method reflects the non-equivalence between sense and anti-sense strand of a coding segment. In both approaches, "linguistic asymmetry" in coding sequences is related to the form of the genetic code and to the bias in codon usage and amino acid use skews.
Collapse
Affiliation(s)
- Christoforos Nikolaou
- National Research Center for Physical Sciences Demokritos, Institute of Biology, 15310 Athens, Greece
| | | |
Collapse
|
71
|
Holste D, Grosse I, Beirer S, Schieg P, Herzel H. Repeats and correlations in human DNA sequences. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2003; 67:061913. [PMID: 16241267 DOI: 10.1103/physreve.67.061913] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2003] [Indexed: 05/04/2023]
Abstract
We study the nucleotide-nucleotide mutual information function I(k) of the DNA sequences of the three completely sequenced human chromosomes 20, 21, and 22. We find in each human chromosome (i) the absence of the k=3 base pair (bp) sequence periodicity characteristic for protein coding regions, (ii) the absence of the k=10-11 bp sequence periodicity characteristic for both protein secondary structure and DNA bendability, and (iii) the presence of significant statistical dependencies at about k=135 bp and at about k=165 bp. We investigate to which degree the density and composition of interspersed repeats might explain these observed statistical patterns in all three human chromosomes. We use simple stochastic models to substitute known interspersed repeats and find by numerical studies that (iv) the presence of interspersed repeats dominates short-range correlations as measured by I(k) on the scale of several hundred base pairs in human chromosomes 20, 21, and 22. On the other hand, we find that (v) interspersed repeats contribute only weakly to long-range correlations due to the clustering of highly abundant Alu repeats.
Collapse
Affiliation(s)
- Dirk Holste
- Department of Biology, Massachusetts Institute of Technology, Cambridge 02139, USA.
| | | | | | | | | |
Collapse
|
72
|
Hsu TH, Nyeo SL. Diffusion coefficients of two-dimensional viral DNA walks. PHYSICAL REVIEW E 2003; 67:051911. [PMID: 12786182 DOI: 10.1103/physreve.67.051911] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/19/2002] [Indexed: 11/07/2022]
Abstract
DNA sequences are represented as two-dimensional walkers based on groups of mapping rules for the nucleotides in the DNA sequences. Digital sequences from irrational and random numbers in base 4 are generated and their diffusion properties are then compared with those of 21 nucleotide sequences of animal and plant viruses. By defining the diffusion coefficient as a function of the number of steps taken in a walk, we show that the coefficients for the viral DNA sequences generally have maximum values considerably larger than those for the random-number sequences of same lengths. Moreover, using the walker diagrams generated by different mapping groups, we can study the dominance of any of the nucleotide pairs (AG or CT), (AC or GT), or (AT or CG) in a DNA sequence. Other possible studies of this approach are mentioned.
Collapse
Affiliation(s)
- Tai-Hsin Hsu
- Department of Physics, National Cheng Kung University, Tainan, Taiwan 701, Republic of China
| | | |
Collapse
|
73
|
Abstract
The base distributions in coding DNA sequences (CDS) are investigated. We explore the scaling properties of the 4-dimensional directed random walk and compare them with that for the DNA sequences. Inference from these observation are, however, contradicted by alternate analysis using factorial moments. To resolve this conflict we look directly at the nucleotide base distributions. In all the cases the base distributions change from gaussian to non-gaussian as the scale size is increased. The CDS, therefore, have nucleotide distributions different from the random.
Collapse
Affiliation(s)
- A Som
- Department of Theoretical Physics, Indian Association for the Cultivation of Science, Jadavpur, Calcutta 700 032, India.
| | | | | |
Collapse
|
74
|
Som A, Sahoo S, Mukhopadhyay I, Chakrabarti J, Chaudhury R. Scaling violations in coding DNA. EUROPHYSICS LETTERS (EPL) 2003; 62:271-277. [DOI: 10.1209/epl/i2003-00341-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/19/2023]
|
75
|
Abstract
The isochore concept in the human genome sequence was challenged in an analysis by the International Human Genome Sequencing Consortium (IHGSC). We argue here that a statement in the IHGSC's analysis concerning the existence of isochores is misleading, because the homogeneity was not examined at a large enough length scale and consequently an inappropriate statistical test was applied. A test of the existence of isochores should be equivalent to a test of homogeneity or equality of windowed GC%. The statistical test applied in the IHGSC's analysis, the binomial test, is a test of whether individual bases are independent and identically-distributed (iid). For testing the existence of isochores, or homogeneity in windowed GC%, we propose to use another statistical test: the analysis of variance (ANOVA). It can be shown that DNA sequences that are rejected by the binomial test may not be rejected by the ANOVA test.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics, North Shore LIJ Research Institute, 350 Community Drive, Manhasset, NY 11030, USA.
| | | | | | | |
Collapse
|
76
|
FUKUSHIMA A, IKEMURA T, KANAYA S. Comparative Genome Analysis Focused on Periodicity from Prokaryote to Higher Eukaryote Genomes Based on Power Spectrum. JOURNAL OF COMPUTER CHEMISTRY-JAPAN 2003. [DOI: 10.2477/jccj.2.95] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
|
77
|
Abstract
Three statistical/mathematical analyses are carried out on isochore sequences: spectral analysis, analysis of variance, and segmentation analysis. Spectral analysis shows that there are GC content fluctuations at different length scales in isochore sequences. The analysis of variance shows that the null hypothesis (the mean value of a group of GC contents remains the same along the sequence) may or may not be rejected for an isochore sequence, depending on the subwindow sizes at which GC contents are sampled, and the window size within which group members are defined. The segmentation analysis shows that there are stronger indications of GC content changes at isochore borders than within an isochore. These analyses support the notion of isochore sequences, but reject the assumption that isochore sequences are homogeneous at the base level. An isochore sequence may pass a homogeneity test when GC content fluctuations at smaller length scales are ignored or averaged out.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics, North Shore - LIJ Research Institute, 350 Community Drive, Manhasset, NY 10030, USA.
| |
Collapse
|
78
|
Li W, Bernaola-Galván P, Haghighi F, Grosse I. Applications of recursive segmentation to the analysis of DNA sequences. COMPUTERS & CHEMISTRY 2002; 26:491-510. [PMID: 12144178 DOI: 10.1016/s0097-8485(02)00010-4] [Citation(s) in RCA: 64] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Recursive segmentation is a procedure that partitions a DNA sequence into domains with a homogeneous composition of the four nucleotides A, C, G and T. This procedure can also be applied to any sequence converted from a DNA sequence, such as to a binary strong(G + C)/weak(A + T) sequence, to a binary sequence indicating the presence or absence of the dinucleotide CpG, or to a sequence indicating both the base and the codon position information. We apply various conversion schemes in order to address the following five DNA sequence analysis problems: isochore mapping, CpG island detection, locating the origin and terminus of replication in bacterial genomes, finding complex repeats in telomere sequences, and delineating coding and noncoding regions. We find that the recursive segmentation procedure can successfully detect isochore borders, CpG islands, and the origin and terminus of replication, but it needs improvement for detecting complex repeats as well as borders between coding and noncoding regions.
Collapse
Affiliation(s)
- Wentian Li
- Center for Genomics and Human Genetics, North Shore-LIJ Research Institute, Manhasset, NY 11030, USA.
| | | | | | | |
Collapse
|
79
|
Wang J, Zhang Q, Ren K, She Z. Multi-scaling hierarchical structure analysis on the sequence ofE. coli complete genome. ACTA ACUST UNITED AC 2001. [DOI: 10.1007/bf02901913] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
80
|
Abstract
The concept of homogeneity of G+C content is always relative and subjective. This point is emphasized and quantified in this paper using a simple example of one sequence segmented into two subsequences. Whether the sequence is homogeneous or not can be answered by whether the two-subsequence model describes the DNA sequence better than the one-sequence model. There are at least three equivalent ways of looking at the 1-to-2 segmentation: Jensen-Shannon divergence measure, log likelihood ratio test, and model selection using Bayesian information criterion. Once a criterion is chosen, a DNA sequence can be recursively segmented into multiple domains. We use one subjective criterion called segmentation strength based on the Bayesian information criterion. Whether or not a sequence is homogeneous and how many domains it has depend on this criterion. We compare six different genome sequences (yeast S. cerevisiae chromosome III and IV, bacterium M. pneumoniae, human major histocompatibility complex sequence, longest contigs in human chromosome 21 and 22) by recursive segmentations at different strength criteria. Results by recursive segmentation confirm that yeast chromosome IV is more homogeneous than yeast chromosome III, human chromosome 21 is more homogeneous than human chromosome 22, and bacterial genomes may not be homogeneous due to short segments with distinct base compositions. The recursive segmentation also provides a quantitative criterion for identifying isochores in human sequences. Some features of our recursive segmentation, such as the possibility of delineating domain borders accurately, are superior to those of the moving-window approach commonly used in such analyses.
Collapse
Affiliation(s)
- W Li
- Laboratory of Statistical Genetics, The Rockefeller University, 1230 York Avenue, Box 192, New York, NY 10021, USA.
| |
Collapse
|
81
|
Abstract
In a DNA sequence that exhibits long-range correlations, standard deviations among the GC levels of its segments can be up to an order of magnitude higher than in a sequence consisting of independent, identically distributed nucleotides. Conversely, plots of inter-segment standard deviations vs. segment length reveal quantitative information about the correlations present in a sequence. We present and discuss formulae that relate long-range (power-law) correlations between the nucleotides of a sequence to the expected standard deviations of the GC levels of its segments, and to the correlations between them.
Collapse
Affiliation(s)
- O Clay
- Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121, Naples, Italy.
| |
Collapse
|
82
|
Clay O, Carels N, Douady C, Macaya G, Bernardi G. Compositional heterogeneity within and among isochores in mammalian genomes. I. CsCl and sequence analyses. Gene 2001; 276:15-24. [PMID: 11591467 DOI: 10.1016/s0378-1119(01)00667-9] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
GC level distributions of a species' nuclear genome, or of its compositional fractions, encode key information on structural and functional properties of the genome and on its evolution. They can be calculated either from absorbance profiles of the DNA in CsCl density gradients at sedimentation equilibrium, or by scanning long contigs of largely sequenced genomes. In the present study, we address the quantitative characterization of the compositional heterogeneity of genomes, as measured by the GC distributions of fixed-length fragments. Special attention is given to mammalian genomes, since their compartmentalization into isochores implies two levels of heterogeneity, intra-isochore (local) and inter-isochore (global). This partitioning is a natural one, since large-scale compositional properties vary much more among isochores than within them. Intra-isochore GC distributions become roughly Gaussian for long fragments, and their standard deviations decrease only slowly with increasing fragment length, unlike random sequences. This effect can be explained by 'long-range' correlations, often overlooked, that are present along isochores.
Collapse
Affiliation(s)
- O Clay
- Laboratory of Molecular Evolution, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121, Naples, Italy
| | | | | | | | | |
Collapse
|
83
|
Almirantis Y, Provata A. An evolutionary model for the origin of non-randomness, long-range order and fractality in the genome. Bioessays 2001; 23:647-56. [PMID: 11462218 DOI: 10.1002/bies.1090] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We present a model for genome evolution, comprising biologically plausible events such as transpositions inside the genome and insertions of exogenous sequences. This model attempts to formulate a minimal proposition accounting for key statistical properties of genomes, avoiding, as far as possible, unsupportable hypotheses for the remote evolutionary past. The statistical properties that are observed in genomic sequences and are reproduced by the proposed model are: (i) deviations from randomness at different length scales, measured by suitable algorithms, (ii) a special form of size distribution (power law distribution) characterising different levels of genome organisation in the non-coding, and (iii) extensive resemblance in the alternation of coding and non-coding regions at several length scales (self-similarity) in long genomic sequences of higher eukaryotes.
Collapse
Affiliation(s)
- Y Almirantis
- Institute of Biology, National Research Centre for Physical Sciences "Demokritos", Athens, Greece.
| | | |
Collapse
|
84
|
Li W. New stopping criteria for segmenting DNA sequences. PHYSICAL REVIEW LETTERS 2001; 86:5815-5818. [PMID: 11415365 DOI: 10.1103/physrevlett.86.5815] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2000] [Indexed: 05/23/2023]
Abstract
We propose a solution on the stopping criterion in segmenting inhomogeneous DNA sequences with complex statistical patterns. This new stopping criterion is based on Bayesian information criterion in the model selection framework. When this criterion is applied to telomere of S. cerevisiae and the complete sequence of E. coli, borders of biologically meaningful units were identified, and a more reasonable number of domains was obtained. We also introduce a measure called segmentation strength which can be used to control the delineation of large domains. The relationship between the average domain size and the threshold of segmentation strength is determined for several genome sequences.
Collapse
Affiliation(s)
- W Li
- Laboratory of Statistical Genetics, Box 192, Rockefeller University, 1230 York Avenue, New York, New York 10021, USA
| |
Collapse
|
85
|
Oiwa NN, Goldman C. Phylogenetic study of the spatial distribution of protein-coding and control segments in DNA chains. PHYSICAL REVIEW LETTERS 2000; 85:2396-2399. [PMID: 10978019 DOI: 10.1103/physrevlett.85.2396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/1999] [Indexed: 05/23/2023]
Abstract
We examine the size and spatial distributions of the protein-coding and control segments of genes in DNA nucleotide sequences from GenBank. Phylogenetic analysis of these data suggests the presence of spatial order in sequences of higher organisms, irrespective of the nature of nucleotide base content. This is characterized by defined two-point correlation functions and measured by fractal dimensions and singularity spectrum.
Collapse
Affiliation(s)
- N N Oiwa
- Instituto de Física, Universidade de São Paulo, CP 66318, 05315-970, São Paulo, Brazil
| | | |
Collapse
|
86
|
Dokholyan NV, Buldyrev SV, Havlin S, Stanley HE. Distributions of dimeric tandem repeats in non-coding and coding DNA sequences. J Theor Biol 2000; 202:273-82. [PMID: 10666360 DOI: 10.1006/jtbi.1999.1052] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
We study the length distribution functions for the 16 possible distinct dimeric tandem repeats in DNA sequences of diverse taxonomic partitions of GenBank (known human and mouse genomes, and complete genomes of Caenorhabditis elegans and yeast). For coding DNA, we find that all 16 distribution functions are exponential. For non-coding DNA, the distribution functions for most of the dimeric repeats have surprisingly long tails, that fit a power-law function. We hypothesize that: (i) the exponential distributions of dimeric repeats in protein coding sequences indicate strong evolutionary pressure against tandem repeat expansion in coding DNA sequences; and (ii) long tails in the distributions of dimers in non-coding DNA may be a result of various mutational mechanisms. These long, non-exponential tails in the distribution of dimeric repeats in non-coding DNA are hypothesized to be due to the higher tolerance of non-coding DNA to mutations. By comparing genomes of various phylogenetic types of organisms, we find that the shapes of the distributions are not universal, but rather depend on the specific class of species and the type of a dimer.
Collapse
Affiliation(s)
- N V Dokholyan
- Center for Polymer Studies, Boston University, Boston, MA 02215, USA.
| | | | | | | |
Collapse
|
87
|
Abstract
We present a new approach to DNA segmentation into compositionally homogeneous blocks. The Bayesian estimator, which is applicable for both short and long segments, is used to obtain the measure of homogeneity. An exact optimal segmentation is found via the dynamic programming technique. After completion of the segmentation procedure, the sequence composition on different scales can be analyzed with filtration of boundaries via the partition function approach.
Collapse
Affiliation(s)
- V E Ramensky
- Engelhardt Institute of Molecular Biology, Vavilova, Russia.
| | | | | |
Collapse
|
88
|
Lobzin VV, Chechetkin VR. Order and correlations in genomic DNA sequences. The spectral approach. ACTA ACUST UNITED AC 2000. [DOI: 10.3367/ufnr.0170.200001c.0057] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
|
89
|
de Sousa Vieira M. Statistics of DNA sequences: a low-frequency analysis. PHYSICAL REVIEW. E, STATISTICAL PHYSICS, PLASMAS, FLUIDS, AND RELATED INTERDISCIPLINARY TOPICS 1999; 60:5932-7. [PMID: 11970495 DOI: 10.1103/physreve.60.5932] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/1999] [Revised: 06/07/1999] [Indexed: 04/18/2023]
Abstract
We study statistical properties of DNA chains of thirteen microbial complete genomes. We find that the power spectrum of several of the sequences studied flattens off in the low frequency limit. This implies the correlation length in those sequences is much smaller than the entire DNA chain. Consequently, in contradiction with previous studies, we show that the fractal behavior of DNA chains does not always prevail through the entire DNA molecule.
Collapse
Affiliation(s)
- M de Sousa Vieira
- Department of Biochemistry and Biophysics, University of California, San Francisco, California 94143-0448, USA.
| |
Collapse
|
90
|
Tino P. Spatial representation of symbolic sequences through iterative function systems. ACTA ACUST UNITED AC 1999. [DOI: 10.1109/3468.769757] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
91
|
Li W. Statistical properties of open reading frames in complete genome sequences. COMPUTERS & CHEMISTRY 1999; 23:283-301. [PMID: 10404621 DOI: 10.1016/s0097-8485(99)00014-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Some statistical properties of open reading frames in all currently available complete genome sequences are analyzed (seventeen prokatyotic genomes, and 16 chromosome sequences from the yeast genome). The size distribution of open reading frames is characterized by various techniques, such as quantile tables, QQ-plots, rank-size plots (Zipf's plots), and spatial densities. The issue of the influence of CG% on the size distribution is addressed. When yeast chromosomes are compared with archaeal and eubacterial genomes, they tend to have more long open reading frames. There is little or no evidence to reject the null hypothesis that open reading frames on six different reading frames and two strands distribute similarly. A topic of current interest, the base composition asymmetry in open reading frames between the two strands, is studied using regression analysis. The base composition asymmetry at three codon positions is analyzed separately. It was shown in these genome sequences that the first codon position is G- and A-rich (i.e. purine-rich); there is a co-existence of A- and T-rich branches at the second codon position; and the third codon position is weakly T-rich.
Collapse
Affiliation(s)
- W Li
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10021, USA.
| |
Collapse
|
92
|
Frontali C, Pizzi E. Similarity in oligonucleotide usage in introns and intergenic regions contributes to long-range correlation in the Caenorhabditis elegans genome. Gene 1999; 232:87-95. [PMID: 10333525 DOI: 10.1016/s0378-1119(99)00111-0] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
A method is presented which allows detection of a sequence correlation effect not related to patchiness in base composition or to preferences in codon usage. Recurrence plots providing local views of oligonucleotide recurrence regimen show that introns and intergenic regions are often characterised by a highly recurrent use of oligonucleotides. By window analysis it is possible to score a long sequence for the recurrence of a given subset of oligos while filtering away the effects of short-range correlations. Long-range exploration of chromosome III from Caenorhabditis elegans reveals that consistent use of recurrent oligonucleotides in introns and intergenic regions generates a correlation effect that extends over several megabases.
Collapse
Affiliation(s)
- C Frontali
- Laboratory of Cell Biology, Istituto Superiore di Sanità, Viale Regina Elena 299, 00161, Rome, Italy.
| | | |
Collapse
|
93
|
Tino P, Koteles M. Extracting finite-state representations from recurrent neural networks trained on chaotic symbolic sequences. ACTA ACUST UNITED AC 1999; 10:284-302. [DOI: 10.1109/72.750555] [Citation(s) in RCA: 23] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
94
|
Li W, Stolovitzky G, Bernaola-Galván P, Oliver JL. Compositional heterogeneity within, and uniformity between, DNA sequences of yeast chromosomes. Genome Res 1998; 8:916-28. [PMID: 9750191 DOI: 10.1101/gr.8.9.916] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The heterogeneity within, and similarities between, yeast chromosomes are studied. For the former, we show by the size distribution of domains, coding density, size distribution of open reading frames, spatial power spectra, and deviation from binomial distribution for C + G% in large moving windows that there is a strong deviation of the yeast sequences from random sequences. For the latter, not only do we graphically illustrate the similarity for the above mentioned statistics, but we also carry out a rigorous analysis of variance (ANOVA) test. The hypothesis that all yeast chromosomes are similar cannot be rejected by this test. We examine the two possible explanations of this interchromosomal uniformity: a common origin, such as genome-wide duplication (polyploidization), and a concerted evolutionary process.
Collapse
Affiliation(s)
- W Li
- Laboratory of Statistical Genetics, Rockefeller University, New York, New York 10021 USA.
| | | | | | | |
Collapse
|