1
|
Rosandić M, Vlahović I, Pilaš I, Glunčić M, Paar V. An Explanation of Exceptions from Chargaff's Second Parity Rule/Strand Symmetry of DNA Molecules. Genes (Basel) 2022; 13:1929. [PMID: 36360166 PMCID: PMC9689577 DOI: 10.3390/genes13111929] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2022] [Revised: 10/12/2022] [Accepted: 10/17/2022] [Indexed: 11/04/2022] Open
Abstract
In this article, we show that mono/oligonucleotide quadruplets, as basic structures of DNA, along with our classification of trinucleotides, disclose an organization of genomes based on purine-pyrimidine symmetry. Moreover, the structure and stability of DNA are influenced by the Watson-Crick pairing and the natural law of DNA creation and conservation, according to which the same mono- or oligonucleotide insertion must be inserted simultaneously into both strands of DNA. Taken together, they lead to quadruplets with central mirror symmetry and bidirectional DNA strand orientation and are incorporated into Chargaff's second parity rule (CSPR). Performing our quadruplet frequency analysis of all human chromosomes and of Neuroblastoma BreakPoint Family (NBPF) genes, which code Olduvai protein domains in the human genome, we show that the coding part of DNA violates CSPR. This may shed new light and give rise to a novel hypothesis on DNA creation and its evolution. In this framework, the logarithmic relationship between oligonucleotide order and minimal DNA sequence length, to establish the validity of CSPR, automatically follows from the quadruplet structure of the genomic sequence. The problem of the violation of CSPR in rare symbionts is discussed.
Collapse
Affiliation(s)
- Marija Rosandić
- University Hospital Centre Zagreb (Ret.), 10000 Zagreb, Croatia
- Croatian Academy of Sciences and Arts, 10000 Zagreb, Croatia
| | - Ines Vlahović
- Faculty of Science, Algebra University College, 10000 Zagreb, Croatia
| | - Ivan Pilaš
- Forest Research Institute, 10450 Jastrebarsko, Croatia
| | - Matko Glunčić
- Physics Department, Faculty of Science, University of Zagreb, 10000 Zagreb, Croatia
| | - Vladimir Paar
- Croatian Academy of Sciences and Arts, 10000 Zagreb, Croatia
- Physics Department, Faculty of Science, University of Zagreb, 10000 Zagreb, Croatia
| |
Collapse
|
2
|
Affinity and Correlation in DNA. J 2022. [DOI: 10.3390/j5020016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
A statistical analysis of important DNA sequences and related proteins has been performed to study the relationships between monomers, and some general considerations about these macromolecules can be provided from the results. First, the most important relationship between sites in all the DNA sequences examined is that between two consecutive base pairs. This is an indication of an energetic stabilization due to the stacking interaction of these couples of base pairs. Secondly, the difference between human chromosome sequences and their coding parts is relevant both in the relationships between sites and in some specific compositional rules, such as the second Chargaff rule. Third, the evidence of the relationship in two successive triplets of DNA coding sequences generates a relationship between two successive amino acids in the proteins. This is obviously impossible if all the relationships between the sites are statistical evidence and do not involve causes; therefore, in this article, due to stacking interactions and this relationship in coding sequences, we will divide the concept of the relationship between sites into two concepts: affinity and correlation, the first with physical causes and the second without. Finally, from the statistical analyses carried out, it will emerge that the human genome is uniform, with the only significant exception being the Y chromosome.
Collapse
|
3
|
Neutralism versus selectionism: Chargaff's second parity rule, revisited. Genetica 2021; 149:81-88. [PMID: 33880685 PMCID: PMC8057000 DOI: 10.1007/s10709-021-00119-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Accepted: 04/09/2021] [Indexed: 11/03/2022]
Abstract
Of Chargaff's four "rules" on DNA base frequencies, the functional interpretation of his second parity rule (PR2) is the most contentious. Thermophile base compositions (GC%) were taken by Galtier and Lobry (1997) as favoring Sueoka's neutral PR2 hypothesis over Forsdyke's selective PR2 hypothesis, namely that mutations improving local within-species recombination efficiency had generated a genome-wide potential for the strands of duplex DNA to separate and initiate recombination through the "kissing" of the tips of stem-loops. However, following Chargaff's GC rule, base composition mainly reflects a species-specific, genome-wide, evolutionary pressure. GC% could not have consistently followed the dictates of temperature, since it plays fundamental roles in both sustaining species integrity and, through primarily neutral genome-wide mutation, fostering speciation. Evidence for a local within-species recombination-initiating role of base order was obtained with a novel technology that masked the contribution of base composition to nucleic acid folding energy. Forsdyke's results were consistent with his PR2 hypothesis, appeared to resolve some root problems in biology and provided a theoretical underpinning for alignment-free taxonomic analyses using relative oligonucleotide frequencies (k-mer analysis). Moreover, consistent with Chargaff's cluster rule, discovery of the thermoadaptive role of the "purine-loading" of open reading frames made less tenable the Galtier-Lobry anti-selectionist arguments.
Collapse
|
4
|
Revisiting the Relationships Between Genomic G + C Content, RNA Secondary Structures, and Optimal Growth Temperature. J Mol Evol 2020; 89:165-171. [PMID: 33216148 DOI: 10.1007/s00239-020-09974-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 11/09/2020] [Indexed: 10/23/2022]
Abstract
Over twenty years ago Galtier and Lobry published a manuscript entitled "Relationships between Genomic G + C Content, RNA Secondary Structure, and Optimal Growth Temperature" in the Journal of Molecular Evolution that showcased the lack of a relationship between genomic G + C content and optimal growth temperature (OGT) in a set of about 200 prokaryotes. Galtier and Lobry also assessed the relationship between RNA secondary structures (rRNA stems, tRNAs) and OGT, and in this case a clear relationship emerged. Increasing structured RNA G + C content (particularly in regions that are double-stranded) correlates with increased OGT. Both of these fundamental relationships have withstood test of many additional sequences and spawned a variety of different applications that include prediction of OGT from rRNA sequence and computational ncRNA identification approaches. In this work, I present the motivation behind Galtier and Lobry's original paper and the larger questions addressed by the work, how these questions have evolved over the last two decades, and the impact of Galtier and Lobry's manuscript in fields beyond these questions.
Collapse
|
5
|
Cristadoro G, Degli Esposti M, Altmann EG. The common origin of symmetry and structure in genetic sequences. Sci Rep 2018; 8:15817. [PMID: 30361485 PMCID: PMC6202410 DOI: 10.1038/s41598-018-34136-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2018] [Accepted: 10/09/2018] [Indexed: 12/20/2022] Open
Abstract
Biologists have long sought a way to explain how statistical properties of genetic sequences emerged and are maintained through evolution. On the one hand, non-random structures at different scales indicate a complex genome organisation. On the other hand, single-strand symmetry has been scrutinised using neutral models in which correlations are not considered or irrelevant, contrary to empirical evidence. Different studies investigated these two statistical features separately, reaching minimal consensus despite sustained efforts. Here we unravel previously unknown symmetries in genetic sequences, which are organized hierarchically through scales in which non-random structures are known to be present. These observations are confirmed through the statistical analysis of the human genome and explained through a simple domain model. These results suggest that domain models which account for the cumulative action of mobile elements can explain simultaneously non-random structures and symmetries in genetic sequences.
Collapse
Affiliation(s)
- Giampaolo Cristadoro
- Dipartimento di Matematica e Applicazioni, Università di Milano-Bicocca, 20125, Milano, Italy.
| | | | - Eduardo G Altmann
- School of Mathematics and Statistics, University of Sydney, Sydney, 2006, NSW, Australia
| |
Collapse
|
6
|
Petoukhov SV. Genetic coding and united-hypercomplex systems in the models of algebraic biology. Biosystems 2017; 158:31-46. [DOI: 10.1016/j.biosystems.2017.05.002] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Revised: 05/10/2017] [Accepted: 05/10/2017] [Indexed: 11/26/2022]
|
7
|
The Matrix Method of Representation, Analysis and Classification of Long Genetic Sequences. INFORMATION 2017. [DOI: 10.3390/info8010012] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
8
|
Apostolou-Karampelis K, Nikolaou C, Almirantis Y. A novel skew analysis reveals substitution asymmetries linked to genetic code GC-biases and PolIII a-subunit isoforms. DNA Res 2016; 23:353-63. [PMID: 27345720 PMCID: PMC4991834 DOI: 10.1093/dnares/dsw021] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2016] [Accepted: 05/09/2016] [Indexed: 11/30/2022] Open
Abstract
Strand biases reflect deviations from a null expectation of DNA evolution that assumes strand-symmetric substitution rates. Here, we present strong evidence that nearest-neighbour preferences are a strand-biased feature of bacterial genomes, indicating neighbour-dependent substitution asymmetries. To detect such asymmetries we introduce an alignment free index (relative abundance skews). The profiles of relative abundance skews along coding sequences can trace the phylogenetic relations of bacteria, suggesting that the patterns of neighbour-dependent substitution strand-biases are not common among different lineages, but are rather species-specific. Analysis of neighbour-dependent and codon-site skews sheds light on the origins of substitution asymmetries. Via a simple model we argue that the structure of the genetic code imposes position-dependent substitution strand-biases along coding sequences, as a response to GC mutation pressure. Thus, the organization of the genetic code per se can lead to an uneven distribution of nucleotides among different codon sites, even when requirements for specific codons and amino-acids are not accounted for. Moreover, our results suggest that strand-biases in replication fidelity of PolIII α-subunit induce substitution asymmetries, both neighbour-dependent and independent, on a genome scale. The role of DNA repair systems, such as transcription-coupled repair, is also considered.
Collapse
Affiliation(s)
| | - Christoforos Nikolaou
- Computational Genomics Group, Department of Biology, University of Crete, 71409 Heraklion, Greece
| | - Yannis Almirantis
- Institute of Biosciences and Applications, National Center for Scientific Research "Demokritos", 15310 Athens, Greece
| |
Collapse
|
9
|
A stationary distribution associated to a set of laws whose initial states are grouped into classes. An application in genomics. J Appl Probab 2016. [DOI: 10.1017/jpr.2016.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Abstract
Let I be a finite set and S be a nonempty strict subset of I which is partitioned into classes, and let C(s) be the class containing s ∈ S. Let (Ps: s ∈ S) be a family of distributions on IN, where each Ps applies to sequences starting with the symbol s. To this family, we associate a class of distributions P(π) on IN which depends on a probability vector π. Our main results assume that, for each s ∈ S, Ps regenerates with distribution Ps' when it encounters s' ∈ S ∖ C(s). From semiregenerative theory, we determine a simple condition on π for P(π) to be time stationary. We give a similar result for the following more complex model. Once a symbol s' ∈ S ∖ C(s) has been encountered, there is a decision to be made: either a new region of type C(s') governed by Ps' starts or the region continues to be a C(s) region. This decision is modeled as a random event and its probability depends on s and s'. The aim in studying these kinds of models is to attain a deeper statistical understanding of bacterial DNA sequences. Here I is the set of codons and the classes (C(s): s ∈ S) identify codons that initiate similar genomic regions. In particular, there are two classes corresponding to the start and stop codons which delimit coding and noncoding regions in bacterial DNA sequences. In addition, the random decision to continue the current region or begin a new region of a different class reflects the well-known fact that not every appearance of a start codon marks the beginning of a new coding region.
Collapse
|
10
|
Rosandić M, Vlahović I, Glunčić M, Paar V. Trinucleotide's quadruplet symmetries and natural symmetry law of DNA creation ensuing Chargaff's second parity rule. J Biomol Struct Dyn 2016; 34:1383-94. [PMID: 26524490 DOI: 10.1080/07391102.2015.1080628] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
For almost 50 years the conclusive explanation of Chargaff's second parity rule (CSPR), the equality of frequencies of nucleotides A=T and C=G or the equality of direct and reverse complement trinucleotides in the same DNA strand, has not been determined yet. Here, we relate CSPR to the interstrand mirror symmetry in 20 symbolic quadruplets of trinucleotides (direct, reverse complement, complement, and reverse) mapped to double-stranded genome. The symmetries of Q-box corresponding to quadruplets can be obtained as a consequence of Watson-Crick base pairing and CSPR together. Alternatively, assuming Natural symmetry law for DNA creation that each trinucleotide in one strand of DNA must simultaneously appear also in the opposite strand automatically leads to Q-box direct-reverse mirror symmetry which in conjunction with Watson-Crick base pairing generates CSPR. We demonstrate quadruplet's symmetries in chromosomes of wide range of organisms, from Escherichia coli to Neanderthal and human genomes, introducing novel quadruplet-frequency histograms and 3D-diagrams with combined interstrand frequencies. These "landscapes" are mutually similar in all mammals, including extinct Neanderthals, and somewhat different in most of older species. In human chromosomes 1-12, and X, Y the "landscapes" are almost identical and slightly different in the remaining smaller and telocentric chromosomes. Quadruplet frequencies could provide a new robust tool for characterization and classification of genomes and their evolutionary trajectories.
Collapse
Affiliation(s)
- Marija Rosandić
- a Croatian Academy of Sciences and Arts, HAZU, Bioinformatics and Biological Physics , Zrinski trg 11, 10000 Zagreb , Croatia
| | - Ines Vlahović
- b Faculty of Science , University of Zagreb , Bijenicka 32, 10000 Zagreb , Croatia
| | - Matko Glunčić
- b Faculty of Science , University of Zagreb , Bijenicka 32, 10000 Zagreb , Croatia
| | - Vladimir Paar
- a Croatian Academy of Sciences and Arts, HAZU, Bioinformatics and Biological Physics , Zrinski trg 11, 10000 Zagreb , Croatia.,b Faculty of Science , University of Zagreb , Bijenicka 32, 10000 Zagreb , Croatia
| |
Collapse
|
11
|
Chargaff’s Cluster Rule. Evol Bioinform Online 2016. [DOI: 10.1007/978-3-319-28755-3_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
|
12
|
Implications of human genome structural heterogeneity: functionally related genes tend to reside in organizationally similar genomic regions. BMC Genomics 2014; 15:252. [PMID: 24684786 PMCID: PMC4234528 DOI: 10.1186/1471-2164-15-252] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2012] [Accepted: 03/21/2014] [Indexed: 01/30/2023] Open
Abstract
Background In an earlier study, we hypothesized that genomic segments with different sequence
organization patterns (OPs) might display functional specificity despite their
similar GC content. Here we tested this hypothesis by dividing the human genome
into 100 kb segments, classifying these segments into five compositional
groups according to GC content, and then characterizing each segment within the
five groups by oligonucleotide counting (k-mer analysis; also referred to as
compositional spectrum analysis, or CSA), to examine the distribution of sequence
OPs in the segments. We performed the CSA on the entire DNA, i.e., its coding and
non-coding parts the latter being much more abundant in the genome than the
former. Results We identified 38 OP-type clusters of segments that differ in their compositional
spectrum (CS) organization. Many of the segments that shared the same OP type were
enriched with genes related to the same biological processes (developmental,
signaling, etc.), components of biochemical complexes, or organelles. Thirteen
OP-type clusters showed significant enrichment in genes connected to specific
gene-ontology terms. Some of these clusters seemed to reflect certain events
during periods of horizontal gene transfer and genome expansion, and subsequent
evolution of genomic regions requiring coordinated regulation. Conclusions There may be a tendency for genes that are involved in the same biological
process, complex or organelle to use the same OP, even at a distance of ~
100 kb from the genes. Although the intergenic DNA is non-coding, the general
pattern of sequence organization (e.g., reflected in over-represented
oligonucleotide “words”) may be important and were protected, to some
extent, in the course of evolution.
Collapse
|
13
|
Rapoport AE, Trifonov EN. Compensatory nature of Chargaff’s second parity rule. J Biomol Struct Dyn 2013; 31:1324-36. [DOI: 10.1080/07391102.2012.736757] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
14
|
Arakawa K, Tomita M. Measures of compositional strand bias related to replication machinery and its applications. Curr Genomics 2012; 13:4-15. [PMID: 22942671 PMCID: PMC3269016 DOI: 10.2174/138920212799034749] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2011] [Revised: 09/10/2011] [Accepted: 09/20/2011] [Indexed: 11/22/2022] Open
Abstract
The compositional asymmetry of complementary bases in nucleotide sequences implies the existence of a mutational or selectional bias in the two strands of the DNA duplex, which is commonly shaped by strand-specific mechanisms in transcription or replication. Such strand bias in genomes, frequently visualized by GC skew graphs, is used for the computational prediction of transcription start sites and replication origins, as well as for comparative evolutionary genomics studies. The use of measures of compositional strand bias in order to quantify the degree of strand asymmetry is crucial, as it is the basis for determining the applicability of compositional analysis and comparing the strength of the mutational bias in different biological machineries in various species. Here, we review the measures of strand bias that have been proposed to date, including the ∆GC skew, the B1 index, the predictability score of linear discriminant analysis for gene orientation, the signal-to-noise ratio of the oligonucleotide bias, and the GC skew index. These measures have been predominantly designed for and applied to the analysis of replication-related mutational processes in prokaryotes, but we also give research examples in eukaryotes.
Collapse
Affiliation(s)
- Kazuharu Arakawa
- Institute for Advanced Biosciences, Keio University, Fujisawa 252-8520, Japan
| | | |
Collapse
|
15
|
Frenkel S, Kirzhner V, Korol A. Organizational heterogeneity of vertebrate genomes. PLoS One 2012; 7:e32076. [PMID: 22384143 PMCID: PMC3288070 DOI: 10.1371/journal.pone.0032076] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2011] [Accepted: 01/23/2012] [Indexed: 01/06/2023] Open
Abstract
Genomes of higher eukaryotes are mosaics of segments with various structural, functional, and evolutionary properties. The availability of whole-genome sequences allows the investigation of their structure as "texts" using different statistical and computational methods. One such method, referred to as Compositional Spectra (CS) analysis, is based on scoring the occurrences of fixed-length oligonucleotides (k-mers) in the target DNA sequence. CS analysis allows generating species- or region-specific characteristics of the genome, regardless of their length and the presence of coding DNA. In this study, we consider the heterogeneity of vertebrate genomes as a joint effect of regional variation in sequence organization superimposed on the differences in nucleotide composition. We estimated compositional and organizational heterogeneity of genome and chromosome sequences separately and found that both heterogeneity types vary widely among genomes as well as among chromosomes in all investigated taxonomic groups. The high correspondence of heterogeneity scores obtained on three genome fractions, coding, repetitive, and the remaining part of the noncoding DNA (the genome dark matter--GDM) allows the assumption that CS-heterogeneity may have functional relevance to genome regulation. Of special interest for such interpretation is the fact that natural GDM sequences display the highest deviation from the corresponding reshuffled sequences.
Collapse
Affiliation(s)
| | | | - Abraham Korol
- Department of Evolutionary and Environmental Biology and Institute of Evolution, University of Haifa, Mount Carmel, Haifa, Israel
| |
Collapse
|
16
|
Mahale KN, Kempraj V, Dasgupta D. Does the growth temperature of a prokaryote influence the purine content of its mRNAs? Gene 2012; 497:83-9. [PMID: 22305982 DOI: 10.1016/j.gene.2012.01.040] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2011] [Accepted: 01/19/2012] [Indexed: 11/20/2022]
Abstract
The formation and breaking of hydrogen bonds between nucleic acid bases are dependent on temperature. The high G+C content of organisms was surmised to be an adaptation for high temperature survival because of the thermal stability of G:C pairs. However, a survey of genomic GC% and optimum growth temperature (OGT) of several prokaryotes revoked any direct relation between them. Significantly high purine (R=A or G) content in mRNAs is also seen as a selective response for survival among thermophiles. Nevertheless, the biological relevance of thermophiles loading their unstable mRNAs with excess purines (purine-loading or R-loading) is not persuasive. Here, we analysed the mRNA sequences from the genomes of 168 prokaryotes (as obtained from NCBI Genome database) with their OGTs ranging from -5 °C to 100 °C to verify the relation between R-loading and OGT. Our analysis fails to demonstrate any correlation between R-loading of the mRNA pool and OGT of a prokaryote. The percentage of purine-loaded mRNAs in prokaryotes is found to be in a rough negative correlation with the genomic GC% (r(2)=0.655, slope=-1.478, P<000.1). We conclude that genomic GC% and bias against certain combinations of nucleotides drive the mRNA-synonymous (sense) strands of DNA towards variations in R-loading.
Collapse
|
17
|
FORSDYKE DONALDR, ZHANG CHIYU, WEI JIFU. CHROMOSOMES AS INTERDEPENDENT ACCOUNTING UNITS: THE ASSIGNED ORIENTATION OFC. ELEGANSCHROMOSOMES MINIMIZES THE TOTAL W-BASE CHARGAFF DIFFERENCE. J BIOL SYST 2011. [DOI: 10.1142/s0218339010003202] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Abstract
DNAs of individual chromosomes violate, albeit perhaps by only one in a thousand bases, Chargaff's second parity rule, which is that Chargaff's first parity rule for duplex DNA (A = T, G = C) applies, to a close approximation, to single stranded DNA. If the "top" strand of one chromosome has A > T and the "top" strand of another has T > A, can they complement to approach even parity (A = T)? Assignment of orientation to the six chromosomes of Caenorhabditis elegans is said to have been arbitrary and, of 26(= 64) possible combinations of top (T) and bottom (B) strands, the GenBank orientation (designated "TTTTTT") is but one. Yet, for the W bases (A and T) the chromosomes in the GenBank orientation complement to reduce the Chargaff difference (A–T) to only 200 bases (i.e. only one in 323,658 bases does not have a potential Watson-Crick pairing partner). This suggests that the assignment was not arbitrary. However, the GenBank orientation for the S bases (G and C) allows an approach to even parity less well than many other orientations, the best of which is BBBBTT (indicating a disparity between the GenBank orientations of the first four autosomes and those of chromosomes V and X). Although only the euchromatic regions of Drosophila melanogaster chromosomes have been sequenced, there are orientations that allow an approach to even parity. We conclude that, with respect to their Chargaff differences, the chromosomes of C. elegans have the potential to engage in interdependent base accounting. Since this might also apply to D. melanogaster, even when heterochromatin-associated DNA rich in tandem repeats (microsatellite DNA) is excluded, then heterochromatic DNA might not normally participate in the hypothetical accounting process.
Collapse
Affiliation(s)
- DONALD R. FORSDYKE
- Department of Biochemistry, Botterell Hall, Queen's University, Kingston, Ontario, Canada K7L3N6, Canada
| | - CHIYU ZHANG
- Institute of Life Sciences, Jiangsu University, Zhenjiang, Jiangsu 212013, China
| | - JI-FU WEI
- The Clinical Experiment Center, The First Affiliated Hospital of Nanjing Medical University, Nanjing, Jiangsu 210029, China
| |
Collapse
|
18
|
|
19
|
Chattopadhyay S, Sahoo S, Kanner WA, Chakrabarti J. Pressures in archaeal protein coding genes: a comparative study. Comp Funct Genomics 2010; 4:56-65. [PMID: 18629113 PMCID: PMC2447400 DOI: 10.1002/cfg.246] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2002] [Accepted: 11/25/2002] [Indexed: 11/06/2022] Open
Abstract
Our studies on the bases of codons from 11 completely sequenced archaeal genomes show that, as we move from GC-rich to AT-rich protein-coding gene-containing species, the differences between G and C and between A and T, the purine load (AG content), and also the overall persistence (i.e. the tendency of a base to be followed by the same base) within codons, all increase almost simultaneously, although the extent of increase is different over the three positions within codons. These findings suggest that the deviations from the second parity rule (through the increasing differences between complementary base contents) and the increasing purine load hinder the chance of formation of the intra-strand Watson-Crick base-paired secondary structures in mRNAs (synonymous with the protein-coding genes we dealt with), thereby increasing the translational efficiency. We hypothesize that the ATrich protein-coding gene-containing archaeal species might have better translational efficiency than their GC-rich counterparts.
Collapse
Affiliation(s)
- Sujay Chattopadhyay
- Department of Theoretical Physics, Indian Association for the Cultivation of Science, Jadavpur, Calcutta 700 032, India.
| | | | | | | |
Collapse
|
20
|
Zhang SH, Huang YZ. Limited contribution of stem-loop potential to symmetry of single-stranded genomic DNA. ACTA ACUST UNITED AC 2009; 26:478-85. [PMID: 20031973 DOI: 10.1093/bioinformatics/btp703] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
MOTIVATION The phenomenon of strand symmetry, which may provide clues to genome evolution, exists in all prokaryotic and eukaryotic genomes studied. Several possible mechanisms for its origins have been proposed, including: no strand biases for mutation and selection, strand inversion and selection of stem-loop structures. However, the relative contributions of these mechanisms to strand symmetry are not clear. In this article, we studied specifically the role of stem-loop potential of single-stranded DNA in strand symmetry. RESULTS We analyzed the complete genomes of 90 prokaryotes. We found that most oligonucleotides (pentanucleotides and higher) do not have a reverse complement in close proximity in the genomic sequences. Combined with further analysis, we conclude that the contribution of the widespread stem-loop potential of single-stranded genomic DNA to the formation and maintenance of strand symmetry would be very limited, at least for higher-order oligonucleotides. Therefore, other possible causes for strand symmetry must be taken into account to a deeper degree.
Collapse
Affiliation(s)
- Shang-Hong Zhang
- The Key Laboratory of Gene Engineering of Ministry of Education, and Biotechnology Research Center, Sun Yat-sen University, Guangzhou 510275, China.
| | | |
Collapse
|
21
|
Acquisti C, Elser JJ, Kumar S. Ecological nitrogen limitation shapes the DNA composition of plant genomes. Mol Biol Evol 2009; 26:953-6. [PMID: 19255140 DOI: 10.1093/molbev/msp038] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
Phenotypes and behaviors respond to resource constraints via adaptation, but the influence of ecological limitations on the composition of eukaryotic genomes is still unclear. We trace connections between plant ecology and genomes through their elemental composition. Inorganic sources of nitrogen (N) are severely limiting to plants in natural ecosystems. This constraint would favor the use of N-poor nucleotides in plant genomes. We show that the transcribed segments of undomesticated plant genomes are the most N poor, with genomes and proteomes bearing signatures of N limitation. Consistent with the predictions of natural selection for N conservation, the precursors of transcriptome show the greatest deviations from Chargaff's second parity rule. Furthermore, crops show higher N contents than undomesticated plants, likely due to the relaxation of natural selection owing to the use of N-rich fertilizers. These findings indicate a fundamental role of N limitation in the evolution of plant genomes, and they link the genomes with the ecosystem context within which biota evolve.
Collapse
|
22
|
Ricard G, de Graaf RM, Dutilh BE, Duarte I, van Alen TA, van Hoek AH, Boxma B, van der Staay GWM, Moon-van der Staay SY, Chang WJ, Landweber LF, Hackstein JHP, Huynen MA. Macronuclear genome structure of the ciliate Nyctotherus ovalis: single-gene chromosomes and tiny introns. BMC Genomics 2008; 9:587. [PMID: 19061489 PMCID: PMC2633312 DOI: 10.1186/1471-2164-9-587] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2008] [Accepted: 12/05/2008] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND Nyctotherus ovalis is a single-celled eukaryote that has hydrogen-producing mitochondria and lives in the hindgut of cockroaches. Like all members of the ciliate taxon, it has two types of nuclei, a micronucleus and a macronucleus. N. ovalis generates its macronuclear chromosomes by forming polytene chromosomes that subsequently develop into macronuclear chromosomes by DNA elimination and rearrangement. RESULTS We examined the structure of these gene-sized macronuclear chromosomes in N. ovalis. We determined the telomeres, subtelomeric regions, UTRs, coding regions and introns by sequencing a large set of macronuclear DNA sequences (4,242) and cDNAs (5,484) and comparing them with each other. The telomeres consist of repeats CCC(AAAACCCC)n, similar to those in spirotrichous ciliates such as Euplotes, Sterkiella (Oxytricha) and Stylonychia. Per sequenced chromosome we found evidence for either a single protein-coding gene, a single tRNA, or the complete ribosomal RNAs cluster. Hence the chromosomes appear to encode single transcripts. In the short subtelomeric regions we identified a few overrepresented motifs that could be involved in gene regulation, but there is no consensus polyadenylation site. The introns are short (21-29 nucleotides), and a significant fraction (1/3) of the tiny introns is conserved in the distantly related ciliate Paramecium tetraurelia. As has been observed in P. tetraurelia, the N. ovalis introns tend to contain in-frame stop codons or have a length that is not dividable by three. This pattern causes premature termination of mRNA translation in the event of intron retention, and potentially degradation of unspliced mRNAs by the nonsense-mediated mRNA decay pathway. CONCLUSION The combination of short leaders, tiny introns and single genes leads to very minimal macronuclear chromosomes. The smallest we identified contained only 150 nucleotides.
Collapse
Affiliation(s)
- Guénola Ricard
- Centre for Molecular and Biomolecular Informatics, Nijmegen Centre for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, Geert Nijmegen, the Netherlands.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
23
|
Sorimachi K, Okayasu T. Codon evolution is governed by linear formulas. Amino Acids 2008; 34:661-8. [PMID: 18180868 DOI: 10.1007/s00726-007-0024-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2007] [Accepted: 12/17/2007] [Indexed: 10/22/2022]
Abstract
When nucleotide (G, C, T and A) contents were plotted against each nucleotide, their relationships were clearly expressed by a linear formula, y = alphax + beta in the coding and non-coding regions. This linear relationship was obtained from the complete single-stranded DNA. Similarly, nucleotide contents at all three codon positions were expressed by linear regression lines based on the content of each nucleotide. In addition, 64 codon usages were also expressed by linear formulas against nucleotide content. Thus, the nucleotide content not only in coding sequence but also in non-coding sequence can be expressed by a linear formula, y = alphax + beta, in 145 organisms (112 bacteria, 15 archaea and 18 eukaryotes). Based on these results, the ratio of C/T, G/T, C/A or G/A one can essentially estimate all four nucleotide contents in the complete single-stranded DNA, and the determination of any ratio of two kinds of nucleotides can essentially estimate four nucleotide contents, nucleotide contents at the three different codon positions and codon distributions at 64 codons in the coding region. The maximum and minimum values of G content were approximately 0.35 and approximately 0.15, respectively, among various organisms examined. Codon evolution occurs according to linear formulas between these two values.
Collapse
Affiliation(s)
- K Sorimachi
- Educational Support Center, Dokkyo Medical University, Mibu, Tochigi 321-0293, Japan.
| | | |
Collapse
|
24
|
Hu J, Zhao X, Yu J. Replication-associated purine asymmetry may contribute to strand-biased gene distribution. Genomics 2007; 90:186-94. [PMID: 17532183 DOI: 10.1016/j.ygeno.2007.04.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2006] [Revised: 03/09/2007] [Accepted: 04/02/2007] [Indexed: 11/19/2022]
Abstract
Among prokaryotic genomes, the distribution of genes on the leading and lagging strands of the replication fork is known to be biased. Several hypotheses explaining this strand-biased gene distribution (SGD) have been proposed, but none have been tested or supported by sufficient data analyses. In this work we have analyzed 211 prokaryotic genomes in terms of compositional strand asymmetries and the presence or absence of polC and have found that SGD correlates not only with polC, but also with purine asymmetry (PAS). Furthermore, SGD, PAS, and polC are all features associated with a group of low-GC, gram-positive bacteria (Firmicutes). We conclude that PAS is a characteristic of organisms with a heterodimeric DNA polymerase III alpha-subunit constituted by polC and dnaE, which may play a direct role in the maintenance of SGD.
Collapse
Affiliation(s)
- Jianfei Hu
- College of Life Sciences, Peking University, Beijing 100871, China.
| | | | | |
Collapse
|
25
|
Evolutionary implications of inversions that have caused intra-strand parity in DNA. BMC Genomics 2007; 8:160. [PMID: 17562011 PMCID: PMC1913523 DOI: 10.1186/1471-2164-8-160] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2007] [Accepted: 06/11/2007] [Indexed: 11/22/2022] Open
Abstract
Background Chargaff's rule of DNA base composition, stating that DNA comprises equal amounts of adenine and thymine (%A = %T) and of guanine and cytosine (%C = %G), is well known because it was fundamental to the conception of the Watson-Crick model of DNA structure. His second parity rule stating that the base proportions of double-stranded DNA are also reflected in single-stranded DNA (%A = %T, %C = %G) is more obscure, likely because its biological basis and significance are still unresolved. Within each strand, the symmetry of single nucleotide composition extends even further, being demonstrated in the balance of di-, tri-, and multi-nucleotides with their respective complementary oligonucleotides. Results Here, we propose that inversions are sufficient to account for the symmetry within each single-stranded DNA. Human mitochondrial DNA does not demonstrate such intra-strand parity, and we consider how its different functional drivers may relate to our theory. This concept is supported by the recent observation that inversions occur frequently. Conclusion Along with chromosomal duplications, inversions must have been shaping the architecture of genomes since the origin of life.
Collapse
|
26
|
Paz A, Mester D, Nevo E, Korol A. Looking for organization patterns of highly expressed genes: purine-pyrimidine composition of precursor mRNAs. J Mol Evol 2007; 64:248-60. [PMID: 17211550 DOI: 10.1007/s00239-006-0135-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2006] [Accepted: 11/19/2006] [Indexed: 01/05/2023]
Abstract
We analyzed precursor messenger RNAs (pre-mRNAs) of 12 eukaryotic species. In each species, three groups of highly expressed genes, ribosomal proteins, heat shock proteins, and amino-acyl tRNA synthetases, were compared with a control group (randomly selected genes). The purine-pyrimidine (R-Y) composition of pre-mRNAs of the three targeted gene groups proved to differ significantly from the control. The exons of the three groups tested have higher purine contents and R-tract abundance and lower abundance of Y-tracts compared to the control (R-tract-tract of sequential purines with Rn>or=5; Y-tract-tract of sequential pyrimidines with Yn>or=5). In species widely employing "intron definition" in the splicing process, the Y content of introns of the three targeted groups appeared to be higher compared to the control group. Furthermore, in all examined species, the introns of the targeted genes have a lower abundance of R-tracts compared to the control. We hypothesized that the R-Y composition of the targeted gene groups contributes to high rate and efficiency of both splicing and translation, in addition to the mRNA coding role. This is presumably achieved by (1) reducing the possibility of the formation of secondary structures in the mRNA, (2) using the R-tracts and R-biased sequences as exonic splicing enhancers, (3) lowering the amount of targets for pyrimidine tract binding protein in the exons, and (4) reducing the amount of target sequences for binding of serine/arginine-rich (SR) proteins in the introns, thereby allowing SR proteins to bind to proper (exonic) targets.
Collapse
Affiliation(s)
- A Paz
- Institute of Evolution, Haifa University, Mount Carmel, Haifa, 31905, Israel
| | | | | | | |
Collapse
|
27
|
Albrecht-Buehler G. Asymptotically increasing compliance of genomes with Chargaff's second parity rules through inversions and inverted transpositions. Proc Natl Acad Sci U S A 2006; 103:17828-33. [PMID: 17093051 PMCID: PMC1635160 DOI: 10.1073/pnas.0605553103] [Citation(s) in RCA: 64] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Chargaff's second parity rules for mononucleotides and oligonucleotides (CIImono and CIIoligo rules) state that a sufficiently long (> 100 kb) strand of genomic DNA that contains N copies of a mono- or oligonucleotide, also contains N copies of its reverse complementary mono- or oligonucleotide on the same strand. There is very strong support in the literature for the validity of the rules in coding and noncoding regions, especially for the CIImono rule. Because the experimental support for the CIIoligo rule is much less complete, the present article, focusing on the special case of trinucleotides (triplets), examined several gigabases of genome sequences from a wide range of species and kingdoms including organelles such as mitochondria and chloroplasts. I found that all genomes, with the only exception of certain mitochondria, complied with the CIItriplet rule at a very high level of accuracy in coding and noncoding regions alike. Based on the growing evidence that genomes may contain up to millions of copies of interspersed repetitive elements, I propose in this article a quantitative formulation of the hypothesis that inversions and inverted transposition could be a major contributing if not dominant factor in the almost universal validity of the rules.
Collapse
Affiliation(s)
- Guenter Albrecht-Buehler
- Department of Cell and Molecular Biology, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA.
| |
Collapse
|
28
|
Forsdyke DR. Chargaff’s Cluster Rule. Evol Bioinform Online 2006. [DOI: 10.1007/978-0-387-33419-6_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
29
|
Cavalcanti ARO, Dunn DM, Weiss R, Herrick G, Landweber LF, Doak TG. Sequence features of Oxytricha trifallax (class Spirotrichea) macronuclear telomeric and subtelomeric sequences. Protist 2005; 155:311-22. [PMID: 15552058 DOI: 10.1078/1434461041844196] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We sequenced and analyzed the subtelomeric regions of 1356 macronuclear "nanochromosomes" of the spirotrichous ciliate Oxytricha trifallax. We show that the telomeres in this species have a length of 20 nt, with minor deviations; there is no correlation between telomere lengths at the two ends of the molecule. A search for open reading frames revealed that the 3' and 5' untranslated regions are short, with a median length of approximately 130 nt, and that surprisingly there are no detectable differences between sequences upstream and downstream of genes. Our results confirm a previously reported purine bias in the first approximately 80 nucleotides of the subtelomeric regions, but with this larger data set we curiously detected a 10 bp periodicity in the bias; we relate this finding to the possible regulatory and structural functions these regions must serve. Palindromic sequences in opposing subtelomeric regions, although present in most sequences, are not statistically significant.
Collapse
Affiliation(s)
- Andre R O Cavalcanti
- Department of Ecology and Evolutionary Biology, Princeton University, NJ 08544, USA
| | | | | | | | | | | |
Collapse
|
30
|
Da Silva M, Upton C. Using purine skews to predict genes in AT-rich poxviruses. BMC Genomics 2005; 6:22. [PMID: 15720717 PMCID: PMC552303 DOI: 10.1186/1471-2164-6-22] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2004] [Accepted: 02/18/2005] [Indexed: 11/13/2022] Open
Abstract
Background Clusters or runs of purines on the mRNA synonymous strand have been found in many different organisms including orthopoxviruses. The purine bias that is exhibited by these clusters can be observed using a purine skew and in the case of poxviruses, these skews can be used to help determine the coding strand of a particular segment of the genome. Combined with previous findings that minor ORFs have lower than average aspartate and glutamate composition and higher than average serine composition, purine content can be used to predict the likelihood of a poxvirus ORF being a "real gene". Results Using purine skews and a "quality" measure designed to incorporate previous findings about minor ORFs, we have found that in our training case (vaccinia virus strain Copenhagen), 59 of 65 minor (small and unlikely to be a real genes) ORFs were correctly classified as being minor. Of the 201 major (large and likely to be real genes) vaccinia ORFs, 192 were correctly classified as being major. Performing a similar analysis with the entomopoxvirus amsacta moorei (AMEV), it was found that 4 major ORFs were incorrectly classified as minor and 9 minor ORFs were incorrectly classified as major. The purine abundance observed for major ORFs in vaccinia virus was found to stem primarily from the first codon position with both the second and third codon positions containing roughly equal amounts of purines and pyrimidines. Conclusion Purine skews and a "quality" measure can be used to predict functional ORFs and purine skews in particular can be used to determine which of two overlapping ORFs is most likely to be the real gene if neither of the two ORFs has orthologs in other poxviruses.
Collapse
Affiliation(s)
- Melissa Da Silva
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8W 3P6, Canada
| | - Chris Upton
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, V8W 3P6, Canada
| |
Collapse
|
31
|
Yagil G. The over-representation of binary DNA tracts in seven sequenced chromosomes. BMC Genomics 2004; 5:19. [PMID: 15113401 PMCID: PMC407849 DOI: 10.1186/1471-2164-5-19] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2003] [Accepted: 03/03/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA tracts composed of only two bases are possible in six combinations: A+G (purines, R), C+T (pyrimidines, Y), G+T (Keto, K), A+C (Imino, M), A+T (Weak, W) and G+C (Strong, S). It is long known that all-pyrimidine tracts, complemented by all-purines tracts ("R.Y tracts"), are excessively present in analyzed DNA. We have previously shown that R.Y tracts are in vast excess in yeast promoters, and brought evidence for their role in gene regulation. Here we report the systematic mapping of all six binary combinations on the level of complete sequenced chromosomes, as well as in their different subregions. RESULTS DNA tracts composed of the above binary base combinations have been mapped in seven sequenced chromosomes: Human chromosomes 21 and 22 (the major contigs); Drosophila melanogaster chr. 2R; Caenorhabditis elegans chr. I; Arabidopsis thaliana chr. II; Saccharomyces cerevisiae chr. IV and M. jannaschii. A huge over-representation, reaching million-folds, has been found for very long tracts of all binary motifs except S, in each of the seven organisms. Long R.Y tracts are the most excessive, except in D. melanogaster, where the K.M motif predominates. S (G, C rich) tracts are in excess mainly in CpG islands; the W motif predominates in bacteria. Many excessively long W tracts are nevertheless found also in the archeon and in the eukaryotes. The survey of complete chromosomes enables us, for the first time, to map systematically the intergenic regions. In human and other chromosomes we find the highest over-representation of the binary DNA tracts in the intergenic regions. These over-representations are only partly explainable by the presence of interspersed elements. CONCLUSIONS The over-representation of long DNA tracts composed of five of the above motifs is the largest deviation from randomness so far established for DNA, and this in a wide range of eukaryotic and archeal chromosomes. A propensity for ready DNA unwinding is proposed as the functional role, explaining the evolutionary conservation of the huge excesses observed.
Collapse
Affiliation(s)
- Gad Yagil
- Dept of Molecular Cell Biology, The Weizmann Institute of Biology, Rehovot, Israel 76100.
| |
Collapse
|
32
|
Conde J. Twofold symmetries in nucleotide distribution in large domains of Saccharomyces cerevisiae Chromosome I. Mol Genet Genomics 2003; 270:287-95. [PMID: 14600830 DOI: 10.1007/s00438-003-0871-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2003] [Accepted: 05/27/2003] [Indexed: 11/26/2022]
Abstract
Single stranded chains of biological DNA show a widespread occurrence of parity for complementary nucleotides, i.e., A=T, G=C. This has been referred to as A-T, G-C symmetry. A distinction must be made between this, which this paper calls mirror symmetry, and twofold symmetry, where complementary nucleotide parity occurs between two segments, of the same length and equidistant from a symmetry center, along a single-stranded DNA chain. I have analysed the sequence of Chromosome I of Saccharomyces cerevisiae for the occurrence of complementary nucleotide symmetry. Open reading frame (ORF) sequences made up 63% of the total chromosome length and most of them were asymmetric for both A-T and G-C. The sign of A-T asymmetry was correlated with transcriptional orientation (A>T for sense and A<T for antisense ORFs), whereas G-C asymmetry was not. However, long single-stranded segments of Chromosome I were A-T mirror symmetric because they contained similar frequencies of ORFs in both transcriptional orientations. The same results were obtained with the AA-TT pair of complementary dinucleotides. Profiling of AA-TT symmetry along Chromosome I showed this chromosome to be organized as a succession of five domains that were twofold symmetric for AA-TT, placed between two subtelomeric regions without clear symmetry properties. This pattern was destroyed when ORF sequences were randomly repositioned along the chromosome. Based on the above findings, an architectural model is proposed for Chromosome I, in which the twofold symmetric domains, from 30 to 50 kb long, correspond to chromosome loops.
Collapse
|
33
|
Lambros RJ, Mortimer JR, Forsdyke DR. Optimum growth temperature and the base composition of open reading frames in prokaryotes. Extremophiles 2003; 7:443-50. [PMID: 14666404 DOI: 10.1007/s00792-003-0353-4] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2003] [Accepted: 06/20/2003] [Indexed: 11/27/2022]
Abstract
The purine-loading index (PLI) is the difference between the numbers of purines (A+G) and pyrimidines (T+C) per kilobase of single-stranded nucleic acid. By purine-loading their mRNAs organisms may minimize unnecessary RNA-RNA interactions and prevent inadvertent formation of "self" double-stranded RNA. Since RNA-RNA interactions have a strong entropy-driven component, this need to minimize should increase as temperature increases. Consistent with this, we report for 550 prokaryotic species that optimum growth temperature is related to the average PLI of open reading frames. With increasing temperature prokaryotes tend to acquire base A and lose base C, while keeping bases T and G relatively constant. Accordingly, while the PLI increases, the (G+C)% decreases. The previously observed positive correlation between (G+C)% and optimum growth temperature, which applies to RNA species whose structure is of major importance for their function (ribosomal and transfer RNAs) does not apply to mRNAs, and hence is unlikely to apply generally to genomic DNA.
Collapse
Affiliation(s)
- R J Lambros
- Department of Biochemistry, Queen's University, Kingston, Ontario K7L3N6, Canada
| | | | | |
Collapse
|
34
|
Xue HY, Forsdyke DR. Low-complexity segments in Plasmodium falciparum proteins are primarily nucleic acid level adaptations. Mol Biochem Parasitol 2003; 128:21-32. [PMID: 12706793 DOI: 10.1016/s0166-6851(03)00039-2] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Protein segments that contain few of the possible 20 amino acids, sometimes in tandem repeat arrays, are referred to as containing "simple" or "low-complexity" sequence. Many Plasmodium falciparum proteins are longer than their homologs in other species by virtue of their content of such low-complexity segments that have no known function; these are interspersed among segments of higher complexity to which function can often be ascribed. If there is low complexity at the protein level, there is likely to be low complexity at the corresponding nucleic acid level (departure from equifrequency of the four bases). Thus, low complexity may have been selected primarily at the nucleic acid level and low complexity at the protein level may be secondary. In this case, the amino acid composition of low-complexity segments should be more reflective than that of high complexity segments on forces operating at the nucleic acid level, which include GC-pressure and AG-pressure. Consistent with this, for amino acid determining first and second codon positions, open reading frames containing low-complexity segments show increased contributions to downward GC-pressure (revealed as decreased percentage of G+C) and to upward AG-pressure (revealed as increased percentage A+G). When not countermanded by high contributions to AG-pressure, low-complexity segments can contribute to base order-dependent fold potential; in this respect, they resemble introns. Thus, in P. falciparum, low-complexity segments appear as adaptations primarily serving nucleic acid level functions.
Collapse
Affiliation(s)
- H Y Xue
- Department of Biochemistry, Queen's University, Kingston, Ont, K7L3N6, Canada
| | | |
Collapse
|
35
|
Barrette IH, McKenna S, Taylor DR, Forsdyke DR. Introns resolve the conflict between base order-dependent stem-loop potential and the encoding of RNA or protein: further evidence from overlapping genes. Gene 2001; 270:181-9. [PMID: 11404015 DOI: 10.1016/s0378-1119(01)00477-2] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Many eukaryotic genes are split into exons and introns, the latter being removed post-transcriptionally so that only exon sequences appear in cytoplasmic RNAs. Since introns appear in both protein-encoding RNAs and non-protein-coding RNAs, they interrupt genetic information per se, not just protein-encoding information. A DNA sequence has the potential to carry more than one type of genetic information, but different types may conflict. Thus, it has been proposed that introns arose because sequences were unable to contain concomitantly complete information for the encoding both of stem-loops and of cytoplasmic products (protein and/or RNA). Stem-loop potential is held to be selectively advantageous since it promotes the recombination-dependent correction of genetic errors. Stem-loop potential, the best local measure of which is base order-dependent stem-loop potential, tends to be less in exons than in introns. This is particularly evident in genes evolving rapidly under positive Darwinian selection, where the protein-encoding function is dominant. Evidence is now presented that the rare regions where genes overlap also impose excessive encoding demands so that the concomitant coding of base order-dependent stem-loop potential is decreased. Our results are consistent with the hypothesis that sequences with high stem-loop potential arose in the early 'RNA world'. Ancestors of modern genes would have entered this world when sequences (exons) encoding cytoplasmic products, were interspersed with sequences (introns) encoding selectively advantageous stem-loops. Purine-loading pressure would also have favoured intron formation.
Collapse
Affiliation(s)
- I H Barrette
- Department of Biochemistry, Queen's University, Kingston, K7L3N6, Ontario, Canada
| | | | | | | |
Collapse
|
36
|
Cristillo AD, Mortimer JR, Barrette IH, Lillicrap TP, Forsdyke DR. Double-stranded RNA as a not-self alarm signal: to evade, most viruses purine-load their RNAs, but some (HTLV-1, Epstein-Barr) pyrimidine-load. J Theor Biol 2001; 208:475-91. [PMID: 11222051 DOI: 10.1006/jtbi.2000.2233] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
For double-stranded RNA (dsRNA) to signal the presence of foreign (non-self) nucleic acid, self-RNA-self-RNA interactions should be minimized. Indeed, self-RNAs appear to have been fine-tuned over evolutionary time by the introduction of purines in clusters in the loop regions of stem-loop structures. This adaptation should militate against the "kissing" interactions which initiate formation of dsRNA. Our analyses of virus base compositions suggest that, to avoid triggering the host cell's dsRNA surveillance mechanism, most viruses purine-load their RNAs to resemble host RNAs ("stealth" strategy). However, some GC-rich latent viruses (HTLV-1, EBV) pyrimidine-load their RNAs. It is suggested that when virus production begins, these RNAs suddenly increase in concentration and impair host mRNA function by virtue of an excess of complementary "kissing" interactions ("surprise" strategy). Remarkably, the only mRNA expressed in the most fundamental form of EBV latency (the "EBNA-1 program") is purine-loaded. This apparent stealth strategy is reinforced by a simple sequence repeat which prefers purine-rich codons. During latent infection the EBNA-1 protein may evade recognition by cytotoxic T-cells, not by virtue of containing a simple sequence amino acid repeat as has been proposed, but by virtue of the encoding mRNA being purine-loaded to prevent interactions with host RNAs of either genic or non-genic origin.
Collapse
Affiliation(s)
- A D Cristillo
- Department of Biochemistry, Queen's University, Kingston, Ontario, K7L3N6, Canada
| | | | | | | | | |
Collapse
|
37
|
Abstract
Full-sequence data available for Plasmodium falciparumchromosomes 2 and 3 are exploited to perform a statistical analysis of the long tracts of biased amino acid composition that characterize the vast majority of P. falciparum proteins and to make a comparison with similarly defined tracts from other simple eukaryotes. When the relatively minor subset of prevalently hydrophobic segments is discarded from the set of low-complexity segments identified by current segmentation methods in P. falciparum proteins, a good correspondence is found between prevalently hydrophilic low-complexity segments and the species-specific, rapidly diverging insertions detected by multiple-alignment procedures when sequences of bona fide homologs are available. Amino acid preferences are fairly uniform in the set of hydrophilic low-complexity segments identified in the twoP. falciparum chromosomes sequenced, as well as in sequenced genes from Plasmodium berghei, but differ from those observed in Saccharomyces cerevisiae and Dictyostelium discoideum. In the two plasmodial species, amino acid frequencies do not correlate with properties such as hydrophilicity, small volume, or flexibility, which might be expected to characterize residues involved in nonglobular domains but do correlate with A-richness in codons. An effect of phenotypic selection versus neutral drift, however, is suggested by the predominance of asparagine over lysine.
Collapse
|
38
|
Abstract
Full-sequence data available for Plasmodium falciparum chromosomes 2 and 3 are exploited to perform a statistical analysis of the long tracts of biased amino acid composition that characterize the vast majority of P. falciparum proteins and to make a comparison with similarly defined tracts from other simple eukaryotes. When the relatively minor subset of prevalently hydrophobic segments is discarded from the set of low-complexity segments identified by current segmentation methods in P. falciparum proteins, a good correspondence is found between prevalently hydrophilic low-complexity segments and the species-specific, rapidly diverging insertions detected by multiple-alignment procedures when sequences of bona fide homologs are available. Amino acid preferences are fairly uniform in the set of hydrophilic low-complexity segments identified in the two P. falciparum chromosomes sequenced, as well as in sequenced genes from Plasmodium berghei, but differ from those observed in Saccharomyces cerevisiae and Dictyostelium discoideum. In the two plasmodial species, amino acid frequencies do not correlate with properties such as hydrophilicity, small volume, or flexibility, which might be expected to characterize residues involved in nonglobular domains but do correlate with A-richness in codons. An effect of phenotypic selection versus neutral drift, however, is suggested by the predominance of asparagine over lysine.
Collapse
Affiliation(s)
- E Pizzi
- Laboratorio di Biologia Cellulare, Istituto Superiore di Sanitá, 00161 Rome, Italy
| | | |
Collapse
|
39
|
Abstract
Of Chargaff's four rules on DNA base composition, only his first parity rule was incorporated into mainstream biology as the DNA double helix. Now, the cluster rule, the second parity rule, and the GC rule, reveal the multiple levels of information in our genomes and potential conflicts between them. In these terms we can understand how double-stranded RNA became an intracellular alarm signal, how potentially recombining nucleic acids can distinguish between 'self' and 'not-self' so leading to the origin of species, how isochores evolved to facilitate gene duplication, and how unlikely it is that any mutation can ever remain truly neutral.
Collapse
Affiliation(s)
- D R Forsdyke
- Department of Biochemistry, Queen's University, Kingston, Ontario K7L3N6, Canada.
| | | |
Collapse
|
40
|
Lao PJ, Forsdyke DR. Crossover hot-spot instigator (Chi) sequences in Escherichia coli occupy distinct recombination/transcription islands. Gene 2000; 243:47-57. [PMID: 10675612 DOI: 10.1016/s0378-1119(99)00564-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Crossover hot-spot instigator (Chi) sequences (5'-GCTGGTGG-3') are orientation-dependent, strand-specific sequences implicated in RecA-mediated DNA recombination. In Escherichia coli and Haemophilus influenzae Chi and Chi-like sequences preferentially locate to approx. 1kb recombination 'islands' in the mRNA-synonymous strands of open reading frames (ORFs). Since mRNA-synonymous strands follow Szybalski's transcription direction rule in being G-rich, and the average ORF is about 1kb, then, on this basis alone, Chi sequences are seen to reside in 1kb G-rich 'islands'. However, RecA preferentially binds GT-rich sequences, suggesting that genomic context might potentiate Chi action. Consistent with this, we report for E. coli that 1kb sequence windows with Chi near their centres are a distinct subset of total 1kb windows, the mRNA-synonymous strands being preferentially enriched in both G and T. Chi function might be particularly important for bacteria that survive high temperature and radiation. These often exist in habitats where recombination with E. coli DNA would be unlikely, so canonical Chi sequences might not confer a selective disadvantage in this respect. In general, Chi sequences are not more frequent in thermophilic bacteria and Deinococcus radiodurans, than in E. coli and other mesophilic bacteria. Only two of five thermophilic bacteria examined showed preferential location of Chi sequences to mRNA-synonymous strands. In the thermophile Methanococcus jannaschii, windows containing the canonical Chi sequence do not form a distinct subset. We suggest that in thermophilic bacteria and D. radiodurans the Chi function may be achieved by sequences that differ from the canonical Chi sequence, or that the number of these sequences is sufficient, or that the Chi function is unnecessary.
Collapse
Affiliation(s)
- P J Lao
- Department of Biochemistry, Queen's University, Kingston, Canada
| | | |
Collapse
|
41
|
Lao PJ, Forsdyke DR. Thermophilic bacteria strictly obey Szybalski's transcription direction rule and politely purine-load RNAs with both adenine and guanine. Genome Res 2000; 10:228-36. [PMID: 10673280 PMCID: PMC310832 DOI: 10.1101/gr.10.2.228] [Citation(s) in RCA: 83] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/1999] [Accepted: 12/16/1999] [Indexed: 11/24/2022]
Abstract
When transcription is to the right of the promoter, the "top," mRNA-synonymous strand of DNA tends to be purine-rich. When transcription is to the left of the promoter, the top, mRNA-template strand tends to be pyrimidine-rich. This transcription-direction rule suggests that there has been an evolutionary selection pressure for the purine-loading of RNAs. The politeness hypothesis states that purine-loading prevents distracting RNA-RNA interactions and excessive formation of double-stranded RNA, which might trigger various intracellular alarms. Because RNA-RNA interactions have a distinct entropy-driven component, the pressure for the evolution of purine-loading might be greater in organisms living at high temperatures. In support of this, we find that Chargaff differences (a measure of purine-loading) are greater in thermophiles than in nonthermophiles and extend to both purine bases. In thermophiles the pressure to purine-load affects codon choice, indicating that some features of their amino acid composition (e.g., high levels of glutamic acid) might reflect purine-loading pressure (i.e., constraints on mRNA) rather than direct constraints on protein structure and function.
Collapse
Affiliation(s)
- P J Lao
- Department of Biochemistry, Queen's University, Kingston, Ontario, K7L 3N6, Canada
| | | |
Collapse
|
42
|
Forsdyke DR. Two levels of information in DNA: relationship of Romanes' "intrinsic" variability of the reproductive system, and Bateson's "residue" to the species-dependent component of the base composition, (C+G)%. J Theor Biol 1999; 201:47-61. [PMID: 10534435 DOI: 10.1006/jtbi.1999.1013] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
In 1886 Charles Darwin's research associate George Romanes published a paper entitled "Physiological Selection: An Additional Suggestion on the Origin of Species". This was criticized by his Victorian contemporaries and largely ignored by those who followed. However, the recent recognition of two levels of information in DNA suggests that Romanes had solved the major problems with Darwin's theory. It was apparent from the outset that the form of reproductive isolation likely to apply most generally to initial species divergence (hybrid sterility), would depend on differences, not in "primary" information ("genic"), but in "secondary" information ("chromosomal"). This viewpoint, further elaborated by Bateson & Saunders (1902), White (1978), and King (1993), is criticized by the genic school (Coyne & Orr, 1998) because it requires visible differences between chromosomes, and appears not to explain Haldane's rule. However, chromosomal differentiation with respect to the species-dependent component of base composition [(C+G)%; Forsdyke, 1996] appears to resolve these problems. Because it explained so much, it was easy to believe that the genic viewpoint explained everything. Romanes and Bateson thought otherwise. We are only just beginning to recognize what they were trying to tell us.
Collapse
Affiliation(s)
- D R Forsdyke
- Department of Biochemistry, Queen's University, Kingston, Ontario, K7L 3N6, Canada.
| |
Collapse
|
43
|
Li W. Statistical properties of open reading frames in complete genome sequences. COMPUTERS & CHEMISTRY 1999; 23:283-301. [PMID: 10404621 DOI: 10.1016/s0097-8485(99)00014-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Some statistical properties of open reading frames in all currently available complete genome sequences are analyzed (seventeen prokatyotic genomes, and 16 chromosome sequences from the yeast genome). The size distribution of open reading frames is characterized by various techniques, such as quantile tables, QQ-plots, rank-size plots (Zipf's plots), and spatial densities. The issue of the influence of CG% on the size distribution is addressed. When yeast chromosomes are compared with archaeal and eubacterial genomes, they tend to have more long open reading frames. There is little or no evidence to reject the null hypothesis that open reading frames on six different reading frames and two strands distribute similarly. A topic of current interest, the base composition asymmetry in open reading frames between the two strands, is studied using regression analysis. The base composition asymmetry at three codon positions is analyzed separately. It was shown in these genome sequences that the first codon position is G- and A-rich (i.e. purine-rich); there is a co-existence of A- and T-rich branches at the second codon position; and the third codon position is weakly T-rich.
Collapse
Affiliation(s)
- W Li
- Laboratory of Statistical Genetics, Rockefeller University, New York, NY 10021, USA.
| |
Collapse
|