1
|
Bernaola-Galván P, Carpena P, Gómez-Martín C, Oliver JL. Compositional Structure of the Genome: A Review. BIOLOGY 2023; 12:849. [PMID: 37372134 PMCID: PMC10295253 DOI: 10.3390/biology12060849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 06/06/2023] [Accepted: 06/07/2023] [Indexed: 06/29/2023]
Abstract
As the genome carries the historical information of a species' biotic and environmental interactions, analyzing changes in genome structure over time by using powerful statistical physics methods (such as entropic segmentation algorithms, fluctuation analysis in DNA walks, or measures of compositional complexity) provides valuable insights into genome evolution. Nucleotide frequencies tend to vary along the DNA chain, resulting in a hierarchically patchy chromosome structure with heterogeneities at different length scales that range from a few nucleotides to tens of millions of them. Fluctuation analysis reveals that these compositional structures can be classified into three main categories: (1) short-range heterogeneities (below a few kilobase pairs (Kbp)) primarily attributed to the alternation of coding and noncoding regions, interspersed or tandem repeats densities, etc.; (2) isochores, spanning tens to hundreds of tens of Kbp; and (3) superstructures, reaching sizes of tens of megabase pairs (Mbp) or even larger. The obtained isochore and superstructure coordinates in the first complete T2T human sequence are now shared in a public database. In this way, interested researchers can use T2T isochore data, as well as the annotations for different genome elements, to check a specific hypothesis about genome structure. Similarly to other levels of biological organization, a hierarchical compositional structure is prevalent in the genome. Once the compositional structure of a genome is identified, various measures can be derived to quantify the heterogeneity of such structure. The distribution of segment G+C content has recently been proposed as a new genome signature that proves to be useful for comparing complete genomes. Another meaningful measure is the sequence compositional complexity (SCC), which has been used for genome structure comparisons. Lastly, we review the recent genome comparisons in species of the ancient phylum Cyanobacteria, conducted by phylogenetic regression of SCC against time, which have revealed positive trends towards higher genome complexity. These findings provide the first evidence for a driven progressive evolution of genome compositional structure.
Collapse
Affiliation(s)
- Pedro Bernaola-Galván
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Pedro Carpena
- Department of Applied Physics II and Institute Carlos I for Theoretical and Computational Physics, University of Málaga, 29071 Málaga, Spain; (P.B.-G.); (P.C.)
| | - Cristina Gómez-Martín
- Department of Pathology, Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands;
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| | - Jose L. Oliver
- Department of Genetics, Faculty of Sciences, 18071 and Laboratory of Bioinformatics, Institute of Biotechnology, Center of Biomedical Research, University of Granada, 18100 Granada, Spain
| |
Collapse
|
2
|
Kolařík M, Wei IC, Hsieh SY, Piepenbring M, Kirschner R. Nucleotide composition bias of rDNA sequences as a source of phylogenetic artifacts in Basidiomycota—a case of a new lineage of a uredinicolous Ramularia-like anamorph with affinities to Ustilaginomycotina. Mycol Prog 2021. [DOI: 10.1007/s11557-021-01749-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
3
|
Simón D, Cristina J, Musto H. Nucleotide Composition and Codon Usage Across Viruses and Their Respective Hosts. Front Microbiol 2021; 12:646300. [PMID: 34262534 PMCID: PMC8274242 DOI: 10.3389/fmicb.2021.646300] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Accepted: 06/04/2021] [Indexed: 11/13/2022] Open
Abstract
The genetic material of the three domains of life (Bacteria, Archaea, and Eukaryota) is always double-stranded DNA, and their GC content (molar content of guanine plus cytosine) varies between ≈ 13% and ≈ 75%. Nucleotide composition is the simplest way of characterizing genomes. Despite this simplicity, it has several implications. Indeed, it is the main factor that determines, among other features, dinucleotide frequencies, repeated short DNA sequences, and codon and amino acid usage. Which forces drive this strong variation is still a matter of controversy. For rather obvious reasons, most of the studies concerning this huge variation and its consequences, have been done in free-living organisms. However, no recent comprehensive study of all known viruses has been done (that is, concerning all available sequences). Viruses, by far the most abundant biological entities on Earth, are the causative agents of many diseases. An overview of these entities is important also because their genetic material is not always double-stranded DNA: indeed, certain viruses have as genetic material single-stranded DNA, double-stranded RNA, single-stranded RNA, and/or retro-transcribing. Therefore, one may wonder if what we have learned about the evolution of GC content and its implications in prokaryotes and eukaryotes also applies to viruses. In this contribution, we attempt to describe compositional properties of ∼ 10,000 viral species: base composition (globally and according to Baltimore classification), correlations among non-coding regions and the three codon positions, and the relationship of the nucleotide frequencies and codon usage of viruses with the same feature of their hosts. This allowed us to determine how the base composition of phages strongly correlate with the value of their respective hosts, while eukaryotic viruses do not (with fungi and protists as exceptions). Finally, we discuss some of these results concerning codon usage: reinforcing previous results, we found that phages and hosts exhibit moderate to high correlations, while for eukaryotes and their viruses the correlations are weak or do not exist.
Collapse
Affiliation(s)
- Diego Simón
- Laboratorio de Genómica Evolutiva, Departamento de Biología Celular y Molecular, Facultad de Ciencias, Universidad de la República, Montevideo, Uruguay.,Laboratorio de Virología Molecular, Centro de Investigaciones Nucleares, Facultad de Ciencias, Universidad de la Republica, Montevideo, Uruguay.,Laboratorio de Evolución Experimental de Virus, Institut Pasteur de Montevideo, Montevideo, Uruguay
| | - Juan Cristina
- Laboratorio de Virología Molecular, Centro de Investigaciones Nucleares, Facultad de Ciencias, Universidad de la Republica, Montevideo, Uruguay
| | - Héctor Musto
- Laboratorio de Genómica Evolutiva, Departamento de Biología Celular y Molecular, Facultad de Ciencias, Universidad de la República, Montevideo, Uruguay
| |
Collapse
|
4
|
Berná L, Rodriguez M, Chiribao ML, Parodi-Talice A, Pita S, Rijo G, Alvarez-Valin F, Robello C. Expanding an expanded genome: long-read sequencing of Trypanosoma cruzi. Microb Genom 2018; 4. [PMID: 29708484 PMCID: PMC5994713 DOI: 10.1099/mgen.0.000177] [Citation(s) in RCA: 76] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Although the genome of Trypanosoma cruzi, the causative agent of Chagas disease, was first made available in 2005, with additional strains reported later, the intrinsic genome complexity of this parasite (the abundance of repetitive sequences and genes organized in tandem) has traditionally hindered high-quality genome assembly and annotation. This also limits diverse types of analyses that require high degrees of precision. Long reads generated by third-generation sequencing technologies are particularly suitable to address the challenges associated with T. cruzi’s genome since they permit direct determination of the full sequence of large clusters of repetitive sequences without collapsing them. This, in turn, not only allows accurate estimation of gene copy numbers but also circumvents assembly fragmentation. Here, we present the analysis of the genome sequences of two T. cruzi clones: the hybrid TCC (TcVI) and the non-hybrid Dm28c (TcI), determined by PacBio Single Molecular Real-Time (SMRT) technology. The improved assemblies herein obtained permitted us to accurately estimate gene copy numbers, abundance and distribution of repetitive sequences (including satellites and retroelements). We found that the genome of T. cruzi is composed of a ‘core compartment’ and a ‘disruptive compartment’ which exhibit opposite GC content and gene composition. Novel tandem and dispersed repetitive sequences were identified, including some located inside coding sequences. Additionally, homologous chromosomes were separately assembled, allowing us to retrieve haplotypes as separate contigs instead of a unique mosaic sequence. Finally, manual annotation of surface multigene families, mucins and trans-sialidases allows now a better overview of these complex groups of genes.
Collapse
Affiliation(s)
- Luisa Berná
- 1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay
| | - Matias Rodriguez
- 2Sección Biomatemática - Unidad de Genómica Evolutiva, Facultad de Ciencias-UDELAR, Montevideo, Uruguay
| | - María Laura Chiribao
- 1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay.,3Departamento de Bioquímica, Facultad de Medicina-UDELAR, Montevideo, Uruguay
| | - Adriana Parodi-Talice
- 1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay.,4Sección Genética, Facultad de Ciencias-UDELAR, Montevideo, Uruguay
| | - Sebastián Pita
- 1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay.,4Sección Genética, Facultad de Ciencias-UDELAR, Montevideo, Uruguay
| | - Gastón Rijo
- 1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay
| | - Fernando Alvarez-Valin
- 2Sección Biomatemática - Unidad de Genómica Evolutiva, Facultad de Ciencias-UDELAR, Montevideo, Uruguay
| | - Carlos Robello
- 1Laboratory of Host Pathogen Interactions-UBM, Institut Pasteur de Montevideo, Montevideo, Uruguay.,3Departamento de Bioquímica, Facultad de Medicina-UDELAR, Montevideo, Uruguay
| |
Collapse
|
5
|
Costantini M, Musto H. The Isochores as a Fundamental Level of Genome Structure and Organization: A General Overview. J Mol Evol 2017; 84:93-103. [PMID: 28243687 DOI: 10.1007/s00239-017-9785-9] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2016] [Accepted: 02/15/2017] [Indexed: 11/30/2022]
Abstract
The recent availability of a number of fully sequenced genomes (including marine organisms) allowed to map very precisely the isochores, based on DNA sequences, confirming the results obtained before genome sequencing by the ultracentrifugation in CsCl. In fact, the analytical profile of human DNA showed that the vertebrate genome is a mosaic of isochores, typically megabase-size DNA segments that belong to a small number of families characterized by different GC levels. In this review, we will concentrate on some general genome features regarding the compositional organization from different organisms and their evolution, ranging from vertebrates to invertebrates until unicellular organisms. Since isochores are tightly linked to biological properties such as gene density, replication timing, and recombination, the new level of detail provided by the isochore map helped the understanding of genome structure, function, and evolution. All the findings reported here confirm the idea that the isochores can be considered as a "fundamental level of genome structure and organization." We stress that we do not discuss in this review the origin of isochores, which is still a matter of controversy, but we focus on well established structural and physiological aspects.
Collapse
Affiliation(s)
- Maria Costantini
- Department of Biology and Evolution of Marine Organisms, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121, Napoli, Italy.
| | - Héctor Musto
- Laboratorio de Organización y Evolución del Genoma, Unidad de Genómica Evolutiva, Facultad de Ciencias, 11400, Montevideo, Uruguay
| |
Collapse
|
6
|
Testa AC, Oliver RP, Hane JK. OcculterCut: A Comprehensive Survey of AT-Rich Regions in Fungal Genomes. Genome Biol Evol 2016; 8:2044-64. [PMID: 27289099 PMCID: PMC4943192 DOI: 10.1093/gbe/evw121] [Citation(s) in RCA: 83] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/14/2016] [Indexed: 12/03/2022] Open
Abstract
We present a novel method to measure the local GC-content bias in genomes and a survey of published fungal species. The method, enacted as "OcculterCut" (https://sourceforge.net/projects/occultercut, last accessed April 30, 2016), identified species containing distinct AT-rich regions. In most fungal taxa, AT-rich regions are a signature of repeat-induced point mutation (RIP), which targets repetitive DNA and decreases GC-content though the conversion of cytosine to thymine bases. RIP has in turn been identified as a driver of fungal genome evolution, as RIP mutations can also occur in single-copy genes neighboring repeat-rich regions. Over time RIP perpetuates "two speeds" of gene evolution in the GC-equilibrated and AT-rich regions of fungal genomes. In this study, genomes showing evidence of this process are found to be common, particularly among the Pezizomycotina. Further analysis highlighted differences in amino acid composition and putative functions of genes from these regions, supporting the hypothesis that these regions play an important role in fungal evolution. OcculterCut can also be used to identify genes undergoing RIP-assisted diversifying selection, such as small, secreted effector proteins that mediate host-microbe disease interactions.
Collapse
Affiliation(s)
- Alison C Testa
- Department of Environment & Agriculture, Centre for Crop and Disease Management, Curtin University, Perth, Australia
| | - Richard P Oliver
- Department of Environment & Agriculture, Centre for Crop and Disease Management, Curtin University, Perth, Australia
| | - James K Hane
- Department of Environment & Agriculture, Centre for Crop and Disease Management, Curtin University, Perth, Australia Curtin Institute for Computation, Curtin University, Perth, Australia
| |
Collapse
|
7
|
Abstract
How the same DNA sequences can function in the three-dimensional architecture of interphase nucleus, fold in the very compact structure of metaphase chromosomes and go precisely back to the original interphase architecture in the following cell cycle remains an unresolved question to this day. The strategy used to address this issue was to analyze the correlations between chromosome architecture and the compositional patterns of DNA sequences spanning a size range from a few hundreds to a few thousands Kilobases. This is a critical range that encompasses isochores, interphase chromatin domains and boundaries, and chromosomal bands. The solution rests on the following key points: 1) the transition from the looped domains and sub-domains of interphase chromatin to the 30-nm fiber loops of early prophase chromosomes goes through the unfolding into an extended chromatin structure (probably a 10-nm "beads-on-a-string" structure); 2) the architectural proteins of interphase chromatin, such as CTCF and cohesin sub-units, are retained in mitosis and are part of the discontinuous protein scaffold of mitotic chromosomes; 3) the conservation of the link between architectural proteins and their binding sites on DNA through the cell cycle explains the "mitotic memory" of interphase architecture and the reversibility of the interphase to mitosis process. The results presented here also lead to a general conclusion which concerns the existence of correlations between the isochore organization of the genome and the architecture of chromosomes from interphase to metaphase.
Collapse
Affiliation(s)
- Giorgio Bernardi
- Science Department, Roma Tre University, Marconi, Rome, Italy
- Stazione Zoologica Anton Dohrn, Villa Comunale, Naples, Italy
| |
Collapse
|
8
|
Costantini M. An overview on genome organization of marine organisms. Mar Genomics 2015; 24 Pt 1:3-9. [PMID: 25899406 DOI: 10.1016/j.margen.2015.03.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Revised: 03/17/2015] [Accepted: 03/17/2015] [Indexed: 11/16/2022]
Abstract
In this review we will concentrate on some general genome features of marine organisms and their evolution, ranging from vertebrate to invertebrates until unicellular organisms. Before genome sequencing, the ultracentrifugation in CsCl led to high resolution of mammalian DNA (without seeing at the sequence). The analytical profile of human DNA showed that the vertebrate genome is a mosaic of isochores, typically megabase-size DNA segments that belong in a small number of families characterized by different GC levels. The recent availability of a number of fully sequenced genomes allowed mapping very precisely the isochores, based on DNA sequences. Since isochores are tightly linked to biological properties such as gene density, replication timing and recombination, the new level of detail provided by the isochore map helped the understanding of genome structure, function and evolution. This led the current level of knowledge and to further insights.
Collapse
Affiliation(s)
- Maria Costantini
- Department of Biology and Evolution of Marine Organisms, Stazione Zoologica Anton Dohrn, Villa Comunale, 80121 Naples, Italy.
| |
Collapse
|
9
|
Elhaik E, Graur D. A comparative study and a phylogenetic exploration of the compositional architectures of mammalian nuclear genomes. PLoS Comput Biol 2014; 10:e1003925. [PMID: 25375262 PMCID: PMC4222635 DOI: 10.1371/journal.pcbi.1003925] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2014] [Accepted: 09/18/2014] [Indexed: 11/18/2022] Open
Abstract
For the past four decades the compositional organization of the mammalian genome posed a formidable challenge to molecular evolutionists attempting to explain it from an evolutionary perspective. Unfortunately, most of the explanations adhered to the "isochore theory," which has long been rebutted. Recently, an alternative compositional domain model was proposed depicting the human and cow genomes as composed mostly of short compositionally homogeneous and nonhomogeneous domains and a few long ones. We test the validity of this model through a rigorous sequence-based analysis of eleven completely sequenced mammalian and avian genomes. Seven attributes of compositional domains are used in the analyses: (1) the number of compositional domains, (2) compositional domain-length distribution, (3) density of compositional domains, (4) genome coverage by the different domain types, (5) degree of fit to a power-law distribution, (6) compositional domain GC content, and (7) the joint distribution of GC content and length of the different domain types. We discuss the evolution of these attributes in light of two competing phylogenetic hypotheses that differ from each other in the validity of clade Euarchontoglires. If valid, the murid genome compositional organization would be a derived state and exhibit a high similarity to that of other mammals. If invalid, the murid genome compositional organization would be closer to an ancestral state. We demonstrate that the compositional organization of the murid genome differs from those of primates and laurasiatherians, a phenomenon previously termed the "murid shift," and in many ways resembles the genome of opossum. We find no support to the "isochore theory." Instead, our findings depict the mammalian genome as a tapestry of mostly short homogeneous and nonhomogeneous domains and few long ones thus providing strong evidence in favor of the compositional domain model and seem to invalidate clade Euarchontoglires.
Collapse
Affiliation(s)
- Eran Elhaik
- Department of Animal and Plant Sciences, University of Sheffield, Sheffield, United Kingdom
- * E-mail:
| | - Dan Graur
- Department of Biology & Biochemistry, University of Houston, Houston, Texas, United States of America
| |
Collapse
|
10
|
Khrustalev VV, Barkovsky EV, Khrustaleva TA, Lelevich SV. Intragenic isochores (intrachores) in the platelet phosphofructokinase gene of Passeriform birds. Gene 2014; 546:16-24. [PMID: 24861647 DOI: 10.1016/j.gene.2014.05.045] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2014] [Revised: 05/09/2014] [Accepted: 05/21/2014] [Indexed: 10/25/2022]
Abstract
Total GC-content in the platelet phosphofructokinase gene of Zebra Finch (Taeniopygia guttata) is low (37.53±0.51%), while there are short areas (about 300 nucleotides in length) with increased GC-content overlapping its exon 4 and exon 17. GC-content in third codon positions (3GC) of those two exons is equal to 88.42 and 80.00%, respectively, while overall 3GC of the coding region is equal to 49.9%. Similar distribution of GC-content has been found in platelet phosphofructokinase genes of other birds from Passeriformes order. According to the results of phylogenetic analysis, formation of those areas with high G+C started from 91.4 to 47.1millionyears ago, since there are no such peaks of GC-content in homologous genes of other birds and reptiles. There are clusters of transcription factor binding sites in those areas with higher GC-content, as well as microRNA precursors conserved in Zebra Finch and Flycatcher genes. According to our hypothesis those intragenic isochores (intrachores) may be consequences of autonomous microRNA precursor transcription at certain period(s) of embryogenesis and gametogenesis, when the platelet phosphofructokinase gene itself is not expressed. Transcription-associated mutational pressure existing during those periods may cause the increase in rates of AT to GC mutations in those genes which are transcribed.
Collapse
Affiliation(s)
| | | | | | - Sergey Vladimirovich Lelevich
- Department of Clinical Laboratory Diagnostics, Allergology and Immunology, Grodno State Medical University, Gorkogo 80, Grodno, Belarus
| |
Collapse
|