1
|
Bohlin J, Pettersson JHO. Compression rates of microbial genomes are associated with genome size and base composition. Genomics Inform 2024; 22:16. [PMID: 39390533 PMCID: PMC11468749 DOI: 10.1186/s44342-024-00018-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 09/10/2024] [Indexed: 10/12/2024] Open
Abstract
BACKGROUND To what degree a string of symbols can be compressed reveals important details about its complexity. For instance, strings that are not compressible are random and carry a low information potential while the opposite is true for highly compressible strings. We explore to what extent microbial genomes are amenable to compression as they vary considerably both with respect to size and base composition. For instance, microbial genome sizes vary from less than 100,000 base pairs in symbionts to more than 10 million in soil-dwellers. Genomic base composition, often summarized as genomic AT or GC content due to the similar frequencies of adenine and thymine on one hand and cytosine and guanine on the other, also vary substantially; the most extreme microbes can have genomes with AT content below 25% or above 85% AT. Base composition determines the frequency of DNA words, consisting of multiple nucleotides or oligonucleotides, and may therefore also influence compressibility. Using 4,713 RefSeq genomes, we examined the association between compressibility, using both a DNA based- (MBGC) and a general purpose (ZPAQ) compression algorithm, and genome size, AT content as well as genomic oligonucleotide usage variance (OUV) using generalized additive models. RESULTS We find that genome size (p < 0.001) and OUV (p < 0.001) are both strongly associated with genome redundancy for both type of file compressors. The DNA-based MBGC compressor managed to improve compression with approximately 3% on average with respect to ZPAQ. Moreover, MBGC detected a significant (p < 0.001) compression ratio difference between AT poor and AT rich genomes which was not detected with ZPAQ. CONCLUSION As lack of compressibility is equivalent to randomness, our findings suggest that smaller and AT rich genomes may have accumulated more random mutations on average than larger and AT poor genomes which, in turn, were significantly more redundant. Moreover, we find that OUV is a strong proxy for genome compressibility in microbial genomes. The ZPAQ compressor was found to agree with the MBGC compressor, albeit with a poorer performance, except for the compressibility of AT-rich and AT-poor/GC-rich genomes.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian Institute of Public Health, Domain for Infection Control, Section for Modeling and Bioinformatics, Oslo, Norway.
| | - John H-O Pettersson
- Zoonosis Science Center, Clinical Microbiology, Department of Medical Sciences, University of Uppsala, 751 85, Uppsala, Sweden
- Clinical Microbiology and Hospital Hygiene, Uppsala University Hospital, 751 85, Uppsala, Sweden
- Department of Microbiology and Immunology, Peter Doherty Institute for Infection and Immunity, University of Melbourne, Melbourne, VIC, Australia
| |
Collapse
|
2
|
de la Fuente R, Díaz-Villanueva W, Arnau V, Moya A. Genomic Signature in Evolutionary Biology: A Review. BIOLOGY 2023; 12:biology12020322. [PMID: 36829597 PMCID: PMC9953303 DOI: 10.3390/biology12020322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 02/11/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023]
Abstract
Organisms are unique physical entities in which information is stored and continuously processed. The digital nature of DNA sequences enables the construction of a dynamic information reservoir. However, the distinction between the hardware and software components in the information flow is crucial to identify the mechanisms generating specific genomic signatures. In this work, we perform a bibliometric analysis to identify the different purposes of looking for particular patterns in DNA sequences associated with a given phenotype. This study has enabled us to make a conceptual breakdown of the genomic signature and differentiate the leading applications. On the one hand, it refers to gene expression profiling associated with a biological function, which may be shared across taxa. This signature is the focus of study in precision medicine. On the other hand, it also refers to characteristic patterns in species-specific DNA sequences. This interpretation plays a key role in comparative genomics, identifying evolutionary relationships. Looking at the relevant studies in our bibliographic database, we highlight the main factors causing heterogeneities in genome composition and how they can be quantified. All these findings lead us to reformulate some questions relevant to evolutionary biology.
Collapse
Affiliation(s)
- Rebeca de la Fuente
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Correspondence:
| | - Wladimiro Díaz-Villanueva
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Vicente Arnau
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
| | - Andrés Moya
- Institute of Integrative Systems Biology (I2Sysbio), University of Valencia and Spanish Research Council (CSIC), 46980 Valencia, Spain
- Foundation for the Promotion of Sanitary and Biomedical Research of the Valencian Community (FISABIO), 46020 Valencia, Spain
- CIBER in Epidemiology and Public Health (CIBEResp), 28029 Madrid, Spain
| |
Collapse
|
3
|
Bohlin J. A simple stochastic model describing the evolution of genomic GC content in asexually reproducing organisms. Sci Rep 2022; 12:18569. [PMID: 36329129 PMCID: PMC9631610 DOI: 10.1038/s41598-022-21709-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Accepted: 09/30/2022] [Indexed: 11/06/2022] Open
Abstract
A genome's nucleotide composition can usually be summarized with (G)uanine + (C)ytosine (GC) or (A)denine + (T)hymine (AT) frequencies as GC% = 100% - AT%. Genomic AT/GC content has been linked to environment and selective processes in asexually reproducing organisms. A model is presented relating the evolution of genomic GC content over time to AT [Formula: see text] GC and GC [Formula: see text] AT mutation rates. By employing Itô calculus it is shown that if mutation rates are subject to random perturbations, that can vary over time, several implications follow. In particular, an extra Brownian motion term appears influencing genomic nucleotide variability; the greater the random perturbations the more genomic nucleotide variability. This can have several interpretations depending on the context. For instance, reducing the influence of the random perturbations on the AT/GC mutation rates and thus genomic nucleotide variability, to limit fitness decreasing and deleterious mutations, will likely suggest channeling of resources. On the other hand, increased genomic nucleotide diversity may be beneficial in variable environments. In asexually reproducing organisms, the Brownian motion term can be considered to be inversely reflective of the selective pressures an organism is subjected to at the molecular level. The presented model is a generalization of a previous model, limited to microbial symbionts, to all asexually reproducing, non-recombining organisms. Last, a connection between the presented model and the classical Luria-Delbrück mutation model is presented in an Itô calculus setting.
Collapse
Affiliation(s)
- Jon Bohlin
- grid.418193.60000 0001 1541 4204Division of Infection Control, Department of Methods Development and Analysis, Norwegian Institute of Public Health, Oslo, Norway ,grid.418193.60000 0001 1541 4204Centre for Fertility and Health, Norwegian Institute of Public Health, P.O. Box 4404, Lovisenberggata 8, 0403 Oslo, Norway
| |
Collapse
|
4
|
Dlamini GS, Muller SJ, Meraba RL, Young RA, Mashiyane J, Chiwewe T, Mapiye DS. Classification of COVID-19 and Other Pathogenic Sequences: A Dinucleotide Frequency and Machine Learning Approach. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2020; 8:195263-195273. [PMID: 34976561 PMCID: PMC8675546 DOI: 10.1109/access.2020.3031387] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 10/04/2020] [Indexed: 05/08/2023]
Abstract
The world is grappling with the COVID-19 pandemic caused by the 2019 novel SARS-CoV-2. To better understand this novel virus and its relationship with other pathogens, new methods for analyzing the genome are required. In this study, intrinsic dinucleotide genomic signatures were analyzed for whole genome sequence data of eight pathogenic species, including SARS-CoV-2. The genome sequences were transformed into dinucleotide relative frequencies and classified using the extreme gradient boosting (XGBoost) model. The classification models were trained to a) distinguish between the sequences of all eight species and b) distinguish between sequences of SARS-CoV-2 that originate from different geographic regions. Our method attained 100% in all performance metrics and for all tasks in the eight-species classification problem. Moreover, the models achieved 67% balanced accuracy for the task of classifying the SARS-CoV-2 sequences into the six continental regions and achieved 86% balanced accuracy for the task of classifying SARS-CoV-2 samples as either originating from Asia or not. Analysis of the dinucleotide genomic profiles of the eight species revealed a similarity between the SARS-CoV-2 and MERS-CoV viral sequences. Further analysis of SARS-CoV-2 viral sequences from the six continents revealed that samples from Oceania had the highest frequency of TT dinucleotides as well as the lowest CG frequency compared to the other continents. The dinucleotide signatures of AC, AG,CA, CT, GA, GT, TC, and TG were well conserved across most genomes, while the frequencies of other dinucleotide signatures varied considerably. Altogether, the results from this study demonstrate the utility of dinucleotide relative frequencies for discriminating and identifying similar species.
Collapse
|
5
|
Barros-Carvalho GA, Van Sluys MA, Lopes FM. An Efficient Approach to Explore and Discriminate Anomalous Regions in Bacterial Genomes Based on Maximum Entropy. J Comput Biol 2017; 24:1125-1133. [PMID: 28570142 DOI: 10.1089/cmb.2017.0042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recently, there has been an increase in the number of whole bacterial genomes sequenced, mainly due to the advancing of next-generation sequencing technologies. In face of this, there is a need to provide new analytical alternatives that can follow this advance. Given our current knowledge about the genomic plasticity of bacteria and that those genomic regions can uncover important features about this microorganism, our goal was to develop a fast methodology based on maximum entropy (ME) to guide the researcher to regions that could be prioritized during the analysis. This methodology was compared with other available methods. In addition, ME was applied to eight different bacterial genera. The methodology consists of two main steps: processing the nucleotide sequence and ME calculation. We applied ME to Xanthomonas axonopodis pv. citri 306 (XAC) and Xanthomonas campestris pv. campestris ATCC 33913 (XCC), both of which have their anomalous regions well documented. We then compared our results against those from Alien Hunter, HGT-DB, Islander, IslandPath, and SIGI-HMM. ME was shown to be superior in terms of efficiency and analysis duration. Besides, ME only needs the genome sequence in FASTA format as input. The proposed strategy based on ME is able to help in bacterial genome exploration. This is a simple and fast strategy for individual genomes in comparison with other available methods, without relying on previous annotation and alignments. This methodology can also be a new option in the early stages of analysis of newly sequenced bacterial genomes.
Collapse
Affiliation(s)
- Gesiele Almeida Barros-Carvalho
- 1 Institute of Mathematics and Statistics, University of São Paulo , São Paulo, Brazil .,2 GaTE Lab, Department of Botany, Institute of Bioscience, University of São Paulo , São Paulo, Brazil
| | - Marie-Anne Van Sluys
- 2 GaTE Lab, Department of Botany, Institute of Bioscience, University of São Paulo , São Paulo, Brazil
| | | |
Collapse
|
6
|
Wu Q, Liu T, Zhu L, Huang H, Jiang L. Insights from the complete genome sequence of Clostridium tyrobutyricum provide a platform for biotechnological and industrial applications. J Ind Microbiol Biotechnol 2017; 44:1245-1260. [PMID: 28536840 DOI: 10.1007/s10295-017-1956-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2016] [Accepted: 05/18/2017] [Indexed: 11/26/2022]
Abstract
Genetic research enables the evolution of novel biochemical reactions for the production of valuable chemicals from environmentally-friendly raw materials. However, the choice of appropriate microorganisms to support these reactions, which must have strong robustness and be capable of a significant product output, is a major difficulty. In the present study, the complete genome of the Clostridium tyrobutyricum strain CCTCC W428, a hydrogen- and butyric acid-producing bacterium with increased oxidative tolerance was analyzed. A total length of 3,011,209 bp of the C. tyrobutyricum genome with a GC content of 31.04% was assembled, and 3038 genes were discovered. Furthermore, a comparative clustering of proteins from C. tyrobutyricum CCTCC W428, C. acetobutylicum ATCC 824, and C. butyricum KNU-L09 was conducted. The results of genomic analysis indicate that butyric acid is produced by CCTCC W428 from butyryl-CoA through acetate reassimilation via CoA transferase, instead of the well-established phosphotransbutyrylase-butyrate kinase pathway. In addition, we identified ten proteins putatively involved in hydrogen production and 21 proteins associated with CRISPR systems, together with 358 ORFs related to ABC transporters and transcriptional regulators. Enzymes, such as oxidoreductases, HNH endonucleases, and catalase, were also found in this species. The genome sequence illustrates that C. tyrobutyricum has several desirable traits, and is expected to be suitable as a platform for the high-level production of bulk chemicals as well as bioenergy.
Collapse
Affiliation(s)
- Qian Wu
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Nanjing, 210019, People's Republic of China
- Jiangsu National Synergetic Innovation Center for Advanced Materials, Nanjing Tech University, Nanjing, 210019, People's Republic of China
| | - Tingting Liu
- College of Biotechnology and Pharmaceutical Engineering, Nanjing Tech University, Nanjing, 210019, People's Republic of China
- Jiangsu National Synergetic Innovation Center for Advanced Materials, Nanjing Tech University, Nanjing, 210019, People's Republic of China
| | - Liying Zhu
- College of Chemical and Molecular Engineering, Nanjing Tech University, Nanjing, 210019, People's Republic of China
| | - He Huang
- College of Pharmaceutical Sciences, Nanjing Tech University, Nanjing, 210009, People's Republic of China
| | - Ling Jiang
- Jiangsu National Synergetic Innovation Center for Advanced Materials, Nanjing Tech University, Nanjing, 210019, People's Republic of China.
- College of Food Science and Light Industry, Nanjing Tech University, Nanjing, 210009, People's Republic of China.
| |
Collapse
|
7
|
Bohlin J, Eldholm V, Pettersson JHO, Brynildsrud O, Snipen L. The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes. BMC Genomics 2017; 18:151. [PMID: 28187704 PMCID: PMC5303225 DOI: 10.1186/s12864-017-3543-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 02/02/2017] [Indexed: 12/02/2022] Open
Abstract
Background The core genome consists of genes shared by the vast majority of a species and is therefore assumed to have been subjected to substantially stronger purifying selection than the more mobile elements of the genome, also known as the accessory genome. Here we examine intragenic base composition differences in core genomes and corresponding accessory genomes in 36 species, represented by the genomes of 731 bacterial strains, to assess the impact of selective forces on base composition in microbes. We also explore, in turn, how these results compare with findings for whole genome intragenic regions. Results We found that GC content in coding regions is significantly higher in core genomes than accessory genomes and whole genomes. Likewise, GC content variation within coding regions was significantly lower in core genomes than in accessory genomes and whole genomes. Relative entropy in coding regions, measured as the difference between observed and expected trinucleotide frequencies estimated from mononucleotide frequencies, was significantly higher in the core genomes than in accessory and whole genomes. Relative entropy was positively associated with coding region GC content within the accessory genomes, but not within the corresponding coding regions of core or whole genomes. Conclusion The higher intragenic GC content and relative entropy, as well as the lower GC content variation, observed in the core genomes is most likely associated with selective constraints. It is unclear whether the positive association between GC content and relative entropy in the more mobile accessory genomes constitutes signatures of selection or selective neutral processes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3543-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jon Bohlin
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway.
| | - Vegard Eldholm
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - John H O Pettersson
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Ola Brynildsrud
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Lars Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, 1430, Ås, Norway
| |
Collapse
|
8
|
Campbell-Sills H, El Khoury M, Favier M, Romano A, Biasioli F, Spano G, Sherman DJ, Bouchez O, Coton E, Coton M, Okada S, Tanaka N, Dols-Lafargue M, Lucas PM. Phylogenomic Analysis of Oenococcus oeni Reveals Specific Domestication of Strains to Cider and Wines. Genome Biol Evol 2015; 7:1506-18. [PMID: 25977455 PMCID: PMC4494047 DOI: 10.1093/gbe/evv084] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Oenococcus oeni is a lactic acid bacteria species encountered particularly in wine, where it achieves the malolactic fermentation. Molecular typing methods have previously revealed that the species is made of several genetic groups of strains, some being specific to certain types of wines, ciders or regions. Here, we describe 36 recently released O. oeni genomes and the phylogenomic analysis of these 36 plus 14 previously reported genomes. We also report three genome sequences of the sister species Oenococcus kitaharae that were used for phylogenomic reconstructions. Phylogenomic and population structure analyses performed revealed that the 50 O. oeni genomes delineate two major groups of 12 and 37 strains, respectively, named A and B, plus a putative group C, consisting of a single strain. A study on the orthologs and single nucleotide polymorphism contents of the genetic groups revealed that the domestication of some strains to products such as cider, wine, or champagne, is reflected at the genetic level. While group A strains proved to be predominant in wine and to form subgroups adapted to specific types of wine such as champagne, group B strains were found in wine and cider. The strain from putative group C was isolated from cider and genetically closer to group B strains. The results suggest that ancestral O. oeni strains were adapted to low-ethanol containing environments such as overripe fruits, and that they were domesticated to cider and wine, with group A strains being naturally selected in a process of further domestication to specific wines such as champagne.
Collapse
Affiliation(s)
- Hugo Campbell-Sills
- Univ. Bordeaux, ISVV, EA 4577 Œnologie, Villenave d'Ornon, France Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Italy
| | | | - Marion Favier
- BioLaffort, Research Subsidiary of the Laffort group, Bordeaux, France
| | - Andrea Romano
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Italy
| | - Franco Biasioli
- Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Italy
| | - Giuseppe Spano
- Department of Agriculture, Food and Environment Sciences, University of Foggia, Foggia, Italy
| | - David J Sherman
- INRIA, Univ. Bordeaux, Project team MAGNOME, Talence, France CNRS, Univ. Bordeaux, UMR 5800 LaBRI, Talence, France
| | - Olivier Bouchez
- INRA, UMR444, laboratoire de Génétique Cellulaire, Castanet-Tolosan, France GeT-PlaGe, Genotoul, INRA Auzeville, Castanet-Tolosan, France
| | - Emmanuel Coton
- Université de Brest, EA 3882, Laboratoire Universitaire de Biodiversité et Ecologie Microbienne, ESIAB, Technopôle Brest-Iroise, Plouzané, France
| | - Monika Coton
- Université de Brest, EA 3882, Laboratoire Universitaire de Biodiversité et Ecologie Microbienne, ESIAB, Technopôle Brest-Iroise, Plouzané, France
| | - Sanae Okada
- NODAI Culture Collection Center, Tokyo University of Agriculture, Japan
| | - Naoto Tanaka
- NODAI Culture Collection Center, Tokyo University of Agriculture, Japan
| | - Marguerite Dols-Lafargue
- Univ. Bordeaux, ISVV, EA 4577 Œnologie, Villenave d'Ornon, France Bordeaux INP, ISVV, EA 4577 Œnologie, Villenave d'ornon, France
| | - Patrick M Lucas
- Univ. Bordeaux, ISVV, EA 4577 Œnologie, Villenave d'Ornon, France
| |
Collapse
|
9
|
Thompson CC, Chimetto L, Edwards RA, Swings J, Stackebrandt E, Thompson FL. Microbial genomic taxonomy. BMC Genomics 2013; 14:913. [PMID: 24365132 PMCID: PMC3879651 DOI: 10.1186/1471-2164-14-913] [Citation(s) in RCA: 248] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2013] [Accepted: 12/18/2013] [Indexed: 01/23/2023] Open
Abstract
A need for a genomic species definition is emerging from several independent studies worldwide. In this commentary paper, we discuss recent studies on the genomic taxonomy of diverse microbial groups and a unified species definition based on genomics. Accordingly, strains from the same microbial species share >95% Average Amino Acid Identity (AAI) and Average Nucleotide Identity (ANI), >95% identity based on multiple alignment genes, <10 in Karlin genomic signature, and > 70% in silico Genome-to-Genome Hybridization similarity (GGDH). Species of the same genus will form monophyletic groups on the basis of 16S rRNA gene sequences, Multilocus Sequence Analysis (MLSA) and supertree analysis. In addition to the established requirements for species descriptions, we propose that new taxa descriptions should also include at least a draft genome sequence of the type strain in order to obtain a clear outlook on the genomic landscape of the novel microbe. The application of the new genomic species definition put forward here will allow researchers to use genome sequences to define simultaneously coherent phenotypic and genomic groups.
Collapse
Affiliation(s)
- Cristiane C Thompson
- Institute of Biology, Federal University of Rio de Janeiro (UFRJ), Rio de Janeiro, Brazil.
| | | | | | | | | | | |
Collapse
|
10
|
Bohlin J, Brynildsrud O, Vesth T, Skjerve E, Ussery DW. Amino acid usage is asymmetrically biased in AT- and GC-rich microbial genomes. PLoS One 2013; 8:e69878. [PMID: 23922837 PMCID: PMC3724673 DOI: 10.1371/journal.pone.0069878] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Accepted: 06/14/2013] [Indexed: 11/18/2022] Open
Abstract
INTRODUCTION Genomic base composition ranges from less than 25% AT to more than 85% AT in prokaryotes. Since only a small fraction of prokaryotic genomes is not protein coding even a minor change in genomic base composition will induce profound protein changes. We examined how amino acid and codon frequencies were distributed in over 2000 microbial genomes and how these distributions were affected by base compositional changes. In addition, we wanted to know how genome-wide amino acid usage was biased in the different genomes and how changes to base composition and mutations affected this bias. To carry this out, we used a Generalized Additive Mixed-effects Model (GAMM) to explore non-linear associations and strong data dependences in closely related microbes; principal component analysis (PCA) was used to examine genomic amino acid- and codon frequencies, while the concept of relative entropy was used to analyze genomic mutation rates. RESULTS We found that genomic amino acid frequencies carried a stronger phylogenetic signal than codon frequencies, but that this signal was weak compared to that of genomic %AT. Further, in contrast to codon usage bias (CUB), amino acid usage bias (AAUB) was differently distributed in AT- and GC-rich genomes in the sense that AT-rich genomes did not prefer specific amino acids over others to the same extent as GC-rich genomes. AAUB was also associated with relative entropy; genomes with low AAUB contained more random mutations as a consequence of relaxed purifying selection than genomes with higher AAUB. CONCLUSION Genomic base composition has a substantial effect on both amino acid- and codon frequencies in bacterial genomes. While phylogeny influenced amino acid usage more in GC-rich genomes, AT-content was driving amino acid usage in AT-rich genomes. We found the GAMM model to be an excellent tool to analyze the genomic data used in this study.
Collapse
Affiliation(s)
- Jon Bohlin
- Centre for Epidemiology and Biostatistics, Department of Food Safety and Infection Biology, Norwegian School of Veterinary Science, Oslo, Norway.
| | | | | | | | | |
Collapse
|
11
|
Logares R, Haverkamp TH, Kumar S, Lanzén A, Nederbragt AJ, Quince C, Kauserud H. Environmental microbiology through the lens of high-throughput DNA sequencing: Synopsis of current platforms and bioinformatics approaches. J Microbiol Methods 2012; 91:106-13. [DOI: 10.1016/j.mimet.2012.07.017] [Citation(s) in RCA: 69] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2012] [Revised: 07/19/2012] [Accepted: 07/23/2012] [Indexed: 10/28/2022]
|
12
|
Bohlin J, van Passel MWJ, Snipen L, Kristoffersen AB, Ussery D, Hardy SP. Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands. BMC Genomics 2012; 13:66. [PMID: 22325062 PMCID: PMC3305612 DOI: 10.1186/1471-2164-13-66] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2011] [Accepted: 02/10/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We sought to assess whether the concept of relative entropy (information capacity), could aid our understanding of the process of horizontal gene transfer in microbes. We analyzed the differences in information capacity between prokaryotic chromosomes, genomic islands (GI), phages, and plasmids. Relative entropy was estimated using the Kullback-Leibler measure. RESULTS Relative entropy was highest in bacterial chromosomes and had the sequence chromosomes > GI > phage > plasmid. There was an association between relative entropy and AT content in chromosomes, phages, plasmids and GIs with the strongest association being in phages. Relative entropy was also found to be lower in the obligate intracellular Mycobacterium leprae than in the related M. tuberculosis when measured on a shared set of highly conserved genes. CONCLUSIONS We argue that relative entropy differences reflect how plasmids, phages and GIs interact with microbial host chromosomes and that all these biological entities are, or have been, subjected to different selective pressures. The rate at which amelioration of horizontally acquired DNA occurs within the chromosome is likely to account for the small differences between chromosomes and stably incorporated GIs compared to the transient or independent replicons such as phages and plasmids.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian School of Veterinary Science, EpiCentre, Department of Food Safety and Infection biology, Ullevålsveien 72, Oslo, Norway.
| | | | | | | | | | | |
Collapse
|
13
|
Roos TE, van Passel MWJ. A quantitative account of genomic island acquisitions in prokaryotes. BMC Genomics 2011; 12:427. [PMID: 21864345 PMCID: PMC3176501 DOI: 10.1186/1471-2164-12-427] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2011] [Accepted: 08/24/2011] [Indexed: 12/15/2022] Open
Abstract
Background Microbial genomes do not merely evolve through the slow accumulation of mutations, but also, and often more dramatically, by taking up new DNA in a process called horizontal gene transfer. These innovation leaps in the acquisition of new traits can take place via the introgression of single genes, but also through the acquisition of large gene clusters, which are termed Genomic Islands. Since only a small proportion of all the DNA diversity has been sequenced, it can be hard to find the appropriate donors for acquired genes via sequence alignments from databases. In contrast, relative oligonucleotide frequencies represent a remarkably stable genomic signature in prokaryotes, which facilitates compositional comparisons as an alignment-free alternative for phylogenetic relatedness. In this project, we test whether Genomic Islands identified in individual bacterial genomes have a similar genomic signature, in terms of relative dinucleotide frequencies, and can therefore be expected to originate from a common donor species. Results When multiple Genomic Islands are present within a single genome, we find that up to 28% of these are compositionally very similar to each other, indicative of frequent recurring acquisitions from the same donor to the same acceptor. Conclusions This represents the first quantitative assessment of common directional transfer events in prokaryotic evolutionary history. We suggest that many of the resident Genomic Islands per prokaryotic genome originated from the same source, which may have implications with respect to their regulatory interactions, and for the elucidation of the common origins of these acquired gene clusters.
Collapse
Affiliation(s)
- Tom E Roos
- Genomics Coordination Center, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | | |
Collapse
|
14
|
Lightfield J, Fram NR, Ely B. Across bacterial phyla, distantly-related genomes with similar genomic GC content have similar patterns of amino acid usage. PLoS One 2011; 6:e17677. [PMID: 21423704 PMCID: PMC3053387 DOI: 10.1371/journal.pone.0017677] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2010] [Accepted: 02/07/2011] [Indexed: 11/24/2022] Open
Abstract
The GC content of bacterial genomes ranges from 16% to 75% and wide ranges of genomic GC content are observed within many bacterial phyla, including both Gram negative and Gram positive phyla. Thus, divergent genomic GC content has evolved repeatedly in widely separated bacterial taxa. Since genomic GC content influences codon usage, we examined codon usage patterns and predicted protein amino acid content as a function of genomic GC content within eight different phyla or classes of bacteria. We found that similar patterns of codon usage and protein amino acid content have evolved independently in all eight groups of bacteria. For example, in each group, use of amino acids encoded by GC-rich codons increased by approximately 1% for each 10% increase in genomic GC content, while the use of amino acids encoded by AT-rich codons decreased by a similar amount. This consistency within every phylum and class studied led us to conclude that GC content appears to be the primary determinant of the codon and amino acid usage patterns observed in bacterial genomes. These results also indicate that selection for translational efficiency of highly expressed genes is constrained by the genomic parameters associated with the GC content of the host genome.
Collapse
Affiliation(s)
- John Lightfield
- Department of Biological Sciences, University of South Carolina, Columbia, South Carolina, United States of America
| | - Noah R. Fram
- Department of Biological Sciences, University of South Carolina, Columbia, South Carolina, United States of America
| | - Bert Ely
- Department of Biological Sciences, University of South Carolina, Columbia, South Carolina, United States of America
- * E-mail:
| |
Collapse
|
15
|
Elser JJ, Acquisti C, Kumar S. Stoichiogenomics: the evolutionary ecology of macromolecular elemental composition. Trends Ecol Evol 2010; 26:38-44. [PMID: 21093095 DOI: 10.1016/j.tree.2010.10.006] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2010] [Revised: 10/21/2010] [Accepted: 10/22/2010] [Indexed: 11/18/2022]
Abstract
The new field of 'stoichiogenomics' integrates evolution, ecology and bioinformatics to reveal surprising patterns of the differential usage of key elements [e.g. nitrogen (N)] in proteins and nucleic acids. Because the canonical amino acids as well as nucleotides differ in element counts, natural selection owing to limited element supplies might bias monomer usage to reduce element costs. For example, proteins that respond to N limitation in microbes use a lower proportion of N-rich amino acids, whereas proteome- and transcriptome-wide element contents differ significantly for plants as compared with animals, probably because of the differential severity of element limitations. In this review, we show that with these findings, new directions for future investigations are emerging, particularly via the increasing availability of diverse metagenomic and metatranscriptomic data sets.
Collapse
Affiliation(s)
- James J Elser
- School of Life Sciences, Arizona State University, Tempe, AZ 85287-4501, USA
| | | | | |
Collapse
|
16
|
Tse H, Cai JJ, Tsoi HW, Lam EP, Yuen KY. Natural selection retains overrepresented out-of-frame stop codons against frameshift peptides in prokaryotes. BMC Genomics 2010; 11:491. [PMID: 20828396 PMCID: PMC2996987 DOI: 10.1186/1471-2164-11-491] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2010] [Accepted: 09/09/2010] [Indexed: 12/03/2022] Open
Abstract
Background Out-of-frame stop codons (OSCs) occur naturally in coding sequences of all organisms, providing a mechanism of early termination of translation in incorrect reading frame so that the metabolic cost associated with frameshift events can be reduced. Given such a functional significance, we expect statistically overrepresented OSCs in coding sequences as a result of a widespread selection. Accordingly, we examined available prokaryotic genomes to look for evidence of this selection. Results The complete genome sequences of 990 prokaryotes were obtained from NCBI GenBank. We found that low G+C content coding sequences contain significantly more OSCs and G+C content at specific codon positions were the principal determinants of OSC usage bias in the different reading frames. To investigate if there is overrepresentation of OSCs, we modeled the trinucleotide and hexanucleotide biases of the coding sequences using Markov models, and calculated the expected OSC frequencies for each organism using a Monte Carlo approach. More than 93% of 342 phylogenetically representative prokaryotic genomes contain excess OSCs. Interestingly the degree of OSC overrepresentation correlates positively with G+C content, which may represent a compensatory mechanism for the negative correlation of OSC frequency with G+C content. We extended the analysis using additional compositional bias models and showed that lower-order bias like codon usage and dipeptide bias could not explain the OSC overrepresentation. The degree of OSC overrepresentation was found to correlate negatively with the optimal growth temperature of the organism after correcting for the G+C% and AT skew of the coding sequence. Conclusions The present study uses approaches with statistical rigor to show that OSC overrepresentation is a widespread phenomenon among prokaryotes. Our results support the hypothesis that OSCs carry functional significance and have been selected in the course of genome evolution to act against unintended frameshift occurrences. Some results also hint that OSC overrepresentation being a compensatory mechanism to make up for the decrease in OSCs in high G+C organisms, thus revealing the interplay between two different determinants of OSC frequency.
Collapse
Affiliation(s)
- Herman Tse
- Carol Yu Centre for Infection, Department of Microbiology, The University of Hong Kong, Hong Kong, China
| | | | | | | | | |
Collapse
|
17
|
Bohlin J, Snipen L, Cloeckaert A, Lagesen K, Ussery D, Kristoffersen AB, Godfroid J. Genomic comparisons of Brucella spp. and closely related bacteria using base compositional and proteome based methods. BMC Evol Biol 2010; 10:249. [PMID: 20707916 PMCID: PMC2928237 DOI: 10.1186/1471-2148-10-249] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2010] [Accepted: 08/13/2010] [Indexed: 11/30/2022] Open
Abstract
Background Classification of bacteria within the genus Brucella has been difficult due in part to considerable genomic homogeneity between the different species and biovars, in spite of clear differences in phenotypes. Therefore, many different methods have been used to assess Brucella taxonomy. In the current work, we examine 32 sequenced genomes from genus Brucella representing the six classical species, as well as more recently described species, using bioinformatical methods. Comparisons were made at the level of genomic DNA using oligonucleotide based methods (Markov chain based genomic signatures, genomic codon and amino acid frequencies based comparisons) and proteomes (all-against-all BLAST protein comparisons and pan-genomic analyses). Results We found that the oligonucleotide based methods gave different results compared to that of the proteome based methods. Differences were also found between the oligonucleotide based methods used. Whilst the Markov chain based genomic signatures grouped the different species in genus Brucella according to host preference, the codon and amino acid frequencies based methods reflected small differences between the Brucella species. Only minor differences could be detected between all genera included in this study using the codon and amino acid frequencies based methods. Proteome comparisons were found to be in strong accordance with current Brucella taxonomy indicating a remarkable association between gene gain or loss on one hand and mutations in marker genes on the other. The proteome based methods found greater similarity between Brucella species and Ochrobactrum species than between species within genus Agrobacterium compared to each other. In other words, proteome comparisons of species within genus Agrobacterium were found to be more diverse than proteome comparisons between species in genus Brucella and genus Ochrobactrum. Pan-genomic analyses indicated that uptake of DNA from outside genus Brucella appears to be limited. Conclusions While both the proteome based methods and the Markov chain based genomic signatures were able to reflect environmental diversity between the different species and strains of genus Brucella, the genomic codon and amino acid frequencies based comparisons were not found adequate for such comparisons. The proteome comparison based phylogenies of the species in genus Brucella showed a surprising consistency with current Brucella taxonomy.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian School of Veterinary Science, Department of Food Safety and Infection Biology, Epicenter, Ullevålsveien 72, PO Box 8146 Dep, NO-0033 Oslo, Norway.
| | | | | | | | | | | | | |
Collapse
|