1
|
Aytan-Aktug D, Grigorjev V, Szarvas J, Clausen PTLC, Munk P, Nguyen M, Davis JJ, Aarestrup FM, Lund O. SourceFinder: a Machine-Learning-Based Tool for Identification of Chromosomal, Plasmid, and Bacteriophage Sequences from Assemblies. Microbiol Spectr 2022; 10:e0264122. [PMID: 36377945 PMCID: PMC9769690 DOI: 10.1128/spectrum.02641-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 11/01/2022] [Indexed: 11/16/2022] Open
Abstract
High-throughput genome sequencing technologies enable the investigation of complex genetic interactions, including the horizontal gene transfer of plasmids and bacteriophages. However, identifying these elements from assembled reads remains challenging due to genome sequence plasticity and the difficulty in assembling complete sequences. In this study, we developed a classifier, using random forest, to identify whether sequences originated from bacterial chromosomes, plasmids, or bacteriophages. The classifier was trained on a diverse collection of 23,211 chromosomal, plasmid, and bacteriophage sequences from hundreds of bacterial species. In order to adapt the classifier to incomplete sequences, each complete sequence was subsampled into 5,000 nucleotide fragments and further subdivided into k-mers. This three-class classifier succeeded in identifying chromosomes, plasmids, and bacteriophages using k-mer distributions of complete and partial genome sequences, including simulated metagenomic scaffolds with minimum performance of 0.939 area under the receiver operating characteristic curve (AUC). This classifier, implemented as SourceFinder, has been made available as an online web service to help the community with predicting the chromosomal, plasmid, and bacteriophage sources of assembled bacterial sequence data (https://cge.food.dtu.dk/services/SourceFinder/). IMPORTANCE Extra-chromosomal genes encoding antimicrobial resistance, metal resistance, and virulence provide selective advantages for bacterial survival under stress conditions and pose serious threats to human and animal health. These accessory genes can impact the composition of microbiomes by providing selective advantages to their hosts. Accurately identifying extra-chromosomal elements in genome sequence data are critical for understanding gene dissemination trajectories and taking preventative measures. Therefore, in this study, we developed a random forest classifier for identifying the source of bacterial chromosomal, plasmid, and bacteriophage sequences.
Collapse
Affiliation(s)
- Derya Aytan-Aktug
- National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Vladislav Grigorjev
- National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Judit Szarvas
- National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | | | - Patrick Munk
- National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Marcus Nguyen
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, Illinois, USA
- Data Science and Learning Division, Argonne National Laboratory, Argonne, Illinois, USA
- Northwestern Argonne Institute for Science and Engineering, Evanston, Illinois, USA
| | - James J. Davis
- Consortium for Advanced Science and Engineering, University of Chicago, Chicago, Illinois, USA
- Data Science and Learning Division, Argonne National Laboratory, Argonne, Illinois, USA
- Northwestern Argonne Institute for Science and Engineering, Evanston, Illinois, USA
| | - Frank M. Aarestrup
- National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Ole Lund
- National Food Institute, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
2
|
Evidence of genomic information and structural restrictions of HIV-1 PR and RT gene regions from individuals experiencing antiretroviral virologic failure. INFECTION GENETICS AND EVOLUTION 2019; 78:104134. [PMID: 31837484 DOI: 10.1016/j.meegid.2019.104134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/30/2019] [Revised: 11/28/2019] [Accepted: 12/04/2019] [Indexed: 10/25/2022]
Abstract
OBJECTIVES This study analyzed Protease-PR and Reverse Transcriptase-RT HIV-1 genomic information entropy metrics among patients under antiretroviral virologic failure, according to the numbers of virologic failures or resistance mutations. METHODS For this purpose, we used genomic sequences from PR and RT of HIV-1 from a cohort of chronic patients followed up at São Paulo Hospital. RESULTS Informational entropy proportionally increases with the number of antiretroviral virologic failures in PR and RT (p < .001). Affected regions of PR were related to catalytic and structural functions, such as Fulcrum (K20) Flap (M46) and Cantilever (A71). In RT, this occurred at Fingers (E44) and Palm (K219). Informational entropy increases according to the number of resistance mutations in PR and RT (p < .001). Higher PR entropy was proportional to the resistance mutation numbers in Fulcrum (L10), Active site (L24) Flap (M46), Cantilever (L63) and near Interface (L90). In RT, they related to regions responsible for protein stability such as Fingers (T39) and Palm (L100). CONCLUSIONS The antiretroviral selective pressure affects HIV genomic informational entropy at the PR and RT regions, leading to the emergence of more unstable virions. Mapping the three-dimensional structure in these HIV-1 proteins is relevant to designing new antiretroviral targeting resistant strains.
Collapse
|
3
|
Huang GD, Liu XM, Huang TL, Xia LC. The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer. Synth Syst Biotechnol 2019; 4:150-156. [PMID: 31508512 PMCID: PMC6723412 DOI: 10.1016/j.synbio.2019.08.001] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2019] [Revised: 07/14/2019] [Accepted: 08/05/2019] [Indexed: 12/21/2022] Open
Abstract
Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics TsumS and Tsum*, which subsample metagenome contigs by their representative regions, and summarize the regional D2S and D2* metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of TsumS and Tsum* increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of TsumS and Tsum* was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
Collapse
Affiliation(s)
- Guan-Da Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Xue-Mei Liu
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Tian-Lai Huang
- School of Physics and Optoelectronics, South China University of Technology, Guangzhou, 510640, China
| | - Li-C Xia
- Department of Medicine, Stanford University School of Medicine, Stanford, CA, 94305, USA
| |
Collapse
|
4
|
Krawczyk PS, Lipinski L, Dziembowski A. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res 2019; 46:e35. [PMID: 29346586 PMCID: PMC5887522 DOI: 10.1093/nar/gkx1321] [Citation(s) in RCA: 296] [Impact Index Per Article: 59.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 12/28/2017] [Indexed: 12/14/2022] Open
Abstract
Plasmids are mobile genetics elements that play an important role in the environmental adaptation of microorganisms. Although plasmids are usually analyzed in cultured microorganisms, there is a need for methods that allow for the analysis of pools of plasmids (plasmidomes) in environmental samples. To that end, several molecular biology and bioinformatics methods have been developed; however, they are limited to environments with low diversity and cannot recover large plasmids. Here, we present PlasFlow, a novel tool based on genomic signatures that employs a neural network approach for identification of bacterial plasmid sequences in environmental samples. PlasFlow can recover plasmid sequences from assembled metagenomes without any prior knowledge of the taxonomical or functional composition of samples with an accuracy up to 96%. It can also recover sequences of both circular and linear plasmids and can perform initial taxonomical classification of sequences. Compared to other currently available tools, PlasFlow demonstrated significantly better performance on test datasets. Analysis of two samples from heavy metal-contaminated microbial mats revealed that plasmids may constitute an important fraction of their metagenomes and carry genes involved in heavy-metal homeostasis, proving the pivotal role of plasmids in microorganism adaptation to environmental conditions.
Collapse
Affiliation(s)
- Pawel S Krawczyk
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Leszek Lipinski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Andrzej Dziembowski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| |
Collapse
|
5
|
Bohlin J, Pettersson JHO. Evolution of Genomic Base Composition: From Single Cell Microbes to Multicellular Animals. Comput Struct Biotechnol J 2019; 17:362-370. [PMID: 30949307 PMCID: PMC6429543 DOI: 10.1016/j.csbj.2019.03.001] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 02/28/2019] [Accepted: 03/01/2019] [Indexed: 01/07/2023] Open
Abstract
Whole genome sequencing (WGS) of thousands of microbial genomes has provided considerable insight into evolutionary mechanisms in the microbial world. While substantially fewer eukaryotic genomes are available for analyses the number is rapidly increasing. This mini-review summarizes broadly evolutionary dynamics of base composition in the different domains of life from the perspective of prokaryotes. Common and different evolutionary mechanisms influencing genomic base composition in eukaryotes and prokaryotes are discussed. The conclusion from the data currently available suggests that while there are similarities there are also striking differences in how genomic base composition has evolved within prokaryotes and eukaryotes. For instance, homologous recombination appears to increase GC content locally in eukaryotes due to a non-selective process termed GC-biased gene conversion (gBGC). For prokaryotes on the other hand, increase in genomic GC content seems to be driven by the environment and selection. We find that similar phenomena observed for some organisms in each respective domain may be caused by very different mechanisms: while gBGC and recombination rates appear to explain the negative correlation between GC3 (GC content based on the third codon nucleotides) and genome size in some eukaryotes uptake of AT rich DNA sequences is the main reason for a similar negative correlation observed in prokaryotes. We provide further examples that indicate that base composition in prokaryotes and eukaryotes have evolved under very different constraints.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian Institute of Public Health, Division of Infection Control and Environmental Health, Department of Infectious Disease Epidemiology and Modelling, Lovisenberggata 8, 0456 Oslo, Norway.,Centre for Fertility and Health, Norwegian Institute of Public Health, PO-Box 222 Skøyen, N-0213 Oslo, Norway.,Norwegian University of Life Sciences, Faculty of Veterinary Sciences, Production Animal Clinical Sciences, Ullevålsveien 72, 0454 Oslo, Norway
| | - John H-O Pettersson
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Life and Environmental Sciences and Sydney Medical School the University of Sydney, New South Wales 2006, Australia.,Zoonosis Science Center, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.,Public Health Agency of Sweden, Nobels vg 18, SE-171 82 Solna, Sweden
| |
Collapse
|
6
|
Sharma V, Mobeen F, Prakash T. Exploration of Survival Traits, Probiotic Determinants, Host Interactions, and Functional Evolution of Bifidobacterial Genomes Using Comparative Genomics. Genes (Basel) 2018; 9:genes9100477. [PMID: 30275399 PMCID: PMC6210967 DOI: 10.3390/genes9100477] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2018] [Accepted: 09/10/2018] [Indexed: 12/15/2022] Open
Abstract
Members of the genus Bifidobacterium are found in a wide-range of habitats and are used as important probiotics. Thus, exploration of their functional traits at the genus level is of utmost significance. Besides, this genus has been demonstrated to exhibit an open pan-genome based on the limited number of genomes used in earlier studies. However, the number of genomes is a crucial factor for pan-genome calculations. We have analyzed the pan-genome of a comparatively larger dataset of 215 members of the genus Bifidobacterium belonging to different habitats, which revealed an open nature. The pan-genome for the 56 probiotic and human-gut strains of this genus, was also found to be open. The accessory- and unique-components of this pan-genome were found to be under the operation of Darwinian selection pressure. Further, their genome-size variation was predicted to be attributed to the abundance of certain functions carried by genomic islands, which are facilitated by insertion elements and prophages. In silico functional and host-microbe interaction analyses of their core-genome revealed significant genomic factors for niche-specific adaptations and probiotic traits. The core survival traits include stress tolerance, biofilm formation, nutrient transport, and Sec-secretion system, whereas the core probiotic traits are imparted by the factors involved in carbohydrate- and protein-metabolism and host-immunomodulations.
Collapse
Affiliation(s)
- Vikas Sharma
- School of Basic Sciences, Indian Institute of Technology Mandi, Kamand, Mandi, Himachal Pradesh 175005, India.
| | - Fauzul Mobeen
- School of Basic Sciences, Indian Institute of Technology Mandi, Kamand, Mandi, Himachal Pradesh 175005, India.
| | - Tulika Prakash
- School of Basic Sciences, Indian Institute of Technology Mandi, Kamand, Mandi, Himachal Pradesh 175005, India.
| |
Collapse
|
7
|
Akhter S, Aziz RK, Kashef MT, Ibrahim ES, Bailey B, Edwards RA. Kullback Leibler divergence in complete bacterial and phage genomes. PeerJ 2017; 5:e4026. [PMID: 29204318 PMCID: PMC5712468 DOI: 10.7717/peerj.4026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 10/22/2017] [Indexed: 12/11/2022] Open
Abstract
The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.
Collapse
Affiliation(s)
- Sajia Akhter
- Computational Science Research Center, San Diego State University, San Diego, CA, USA
| | - Ramy K Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America
| | - Mona T Kashef
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Eslam S Ibrahim
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Barbara Bailey
- Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA
| | - Robert A Edwards
- Computational Science Research Center, San Diego State University, San Diego, CA, USA.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America.,Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA.,Department of Biology, San Diego State University, San Diego, CA, USA
| |
Collapse
|
8
|
Bohlin J, Eldholm V, Pettersson JHO, Brynildsrud O, Snipen L. The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes. BMC Genomics 2017; 18:151. [PMID: 28187704 PMCID: PMC5303225 DOI: 10.1186/s12864-017-3543-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 02/02/2017] [Indexed: 12/02/2022] Open
Abstract
Background The core genome consists of genes shared by the vast majority of a species and is therefore assumed to have been subjected to substantially stronger purifying selection than the more mobile elements of the genome, also known as the accessory genome. Here we examine intragenic base composition differences in core genomes and corresponding accessory genomes in 36 species, represented by the genomes of 731 bacterial strains, to assess the impact of selective forces on base composition in microbes. We also explore, in turn, how these results compare with findings for whole genome intragenic regions. Results We found that GC content in coding regions is significantly higher in core genomes than accessory genomes and whole genomes. Likewise, GC content variation within coding regions was significantly lower in core genomes than in accessory genomes and whole genomes. Relative entropy in coding regions, measured as the difference between observed and expected trinucleotide frequencies estimated from mononucleotide frequencies, was significantly higher in the core genomes than in accessory and whole genomes. Relative entropy was positively associated with coding region GC content within the accessory genomes, but not within the corresponding coding regions of core or whole genomes. Conclusion The higher intragenic GC content and relative entropy, as well as the lower GC content variation, observed in the core genomes is most likely associated with selective constraints. It is unclear whether the positive association between GC content and relative entropy in the more mobile accessory genomes constitutes signatures of selection or selective neutral processes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3543-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jon Bohlin
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway.
| | - Vegard Eldholm
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - John H O Pettersson
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Ola Brynildsrud
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Lars Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, 1430, Ås, Norway
| |
Collapse
|
9
|
Li Z, Cao H, Cui Y, Zhang Y. Extracting DNA words based on the sequence features: non-uniform distribution and integrity. Theor Biol Med Model 2016; 13:2. [PMID: 26811154 PMCID: PMC4727310 DOI: 10.1186/s12976-016-0028-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2015] [Accepted: 01/14/2016] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND DNA sequence can be viewed as an unknown language with words as its functional units. Given that most sequence alignment algorithms such as the motif discovery algorithms depend on the quality of background information about sequences, it is necessary to develop an ab initio algorithm for extracting the "words" based only on the DNA sequences. METHODS We considered that non-uniform distribution and integrity were two important features of a word, based on which we developed an ab initio algorithm to extract "DNA words" that have potential functional meaning. A Kolmogorov-Smirnov test was used for consistency test of uniform distribution of DNA sequences, and the integrity was judged by the sequence and position alignment. Two random base sequences were adopted as negative control, and an English book was used as positive control to verify our algorithm. We applied our algorithm to the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli to show the utility of the methods. RESULTS The results provide strong evidences that the algorithm is a promising tool for ab initio building a DNA dictionary. CONCLUSIONS Our method provides a fast way for large scale screening of important DNA elements and offers potential insights into the understanding of a genome.
Collapse
Affiliation(s)
- Zhi Li
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
| | - Hongyan Cao
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
| | - Yuehua Cui
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, USA.
| | - Yanbo Zhang
- Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, 030001, China.
| |
Collapse
|
10
|
Bohlin J. Genome expansion in bacteria: the curios case of Chlamydia trachomatis. BMC Res Notes 2015; 8:512. [PMID: 26423146 PMCID: PMC4589037 DOI: 10.1186/s13104-015-1464-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2015] [Accepted: 09/21/2015] [Indexed: 11/23/2022] Open
Abstract
Background Recent findings indicated that a correlation between genomic % AT and genome size within strains of microbial species was predominantly associated with the uptake of foreign DNA. One species however, Chlamydia trachomatis, defied any explanation. In the present study 79 fully sequenced C. trachomatis genomes, representing ocular- (nine strains), urogenital- (36 strains) and lymphogranuloma venereum strains (LGV, 22 strains), in three pathogroups, in addition to 12 laboratory isolates, were scrutinized with the intent of elucidating the positive correlation between genomic AT content and genome size. Results The average size difference between the strains of each pathogroup was largely explained by the incorporation of genetic fragments. These fragments were slightly more AT rich than their corresponding host genomes, but not enough to justify the difference in AT content between the strains of the smaller genomes lacking the fragments. In addition, a genetic region predominantly found in the ocular strains, which had the largest genomes, was on average more GC rich than the host genomes of the urogenital strains (58.64 % AT vs. 58.69 % AT), which had the second largest genomes, implying that the foreign genetic regions cannot alone explain the association between genome size and AT content in C. trachomatis. 23,492 SNPs were identified for all 79 genomes, and although the SNPs were on average slightly GC rich (~47 % AT), a significant association was found between genome-wide SNP AT content, for each pathogroup, and genome size (p < 0.001, R2 = 0.86) in the C. trachomatis strains. Conclusions The correlation between genome size and AT content, with respect to the C. trachomatis pathogroups, was explained by the incorporation of genetic fragments unique to the ocular and/or urogenital strains into the LGV- and urogential strains in addition to the genome-wide SNP AT content differences between the three pathogroups. Electronic supplementary material The online version of this article (doi:10.1186/s13104-015-1464-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jon Bohlin
- Department of Bacteriology and Immunology, Norwegian Institute of Public Health, Lovisenberggata 6, P.O. Box 4404, 0403, Oslo, Norway.
| |
Collapse
|
11
|
van Zyl LJ, Sunda F, Taylor MP, Cowan DA, Trindade MI. Identification and characterization of a novel Geobacillus thermoglucosidasius bacteriophage, GVE3. Arch Virol 2015; 160:2269-82. [PMID: 26123922 DOI: 10.1007/s00705-015-2497-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2015] [Accepted: 06/12/2015] [Indexed: 11/25/2022]
Abstract
The study of extremophilic phages may reveal new phage families as well as different mechanisms of infection, propagation and lysis to those found in phages from temperate environments. We describe a novel siphovirus, GVE3, which infects the thermophile Geobacillus thermoglucosidasius. The genome size is 141,298 bp (G+C 29.6%), making it the largest Geobacillus spp-infecting phage known. GVE3 appears to be most closely related to the recently described Bacillus anthracis phage vB_BanS_Tsamsa, rather than Geobacillus-infecting phages described thus far. Tetranucleotide usage deviation analysis supports this relationship, showing that the GVE3 genome sequence correlates best with B. anthracis and Bacillus cereus genome sequences, rather than Geobacillus spp genome sequences.
Collapse
Affiliation(s)
- Leonardo Joaquim van Zyl
- Institute for Microbial Biotechnology and Metagenomics (IMBM), University of the Western Cape, Robert Sobukwe Road, Bellville, Cape Town, South Africa,
| | | | | | | | | |
Collapse
|
12
|
Bohlin J, Brynildsrud OB, Sekse C, Snipen L. An evolutionary analysis of genome expansion and pathogenicity in Escherichia coli. BMC Genomics 2014; 15:882. [PMID: 25297974 PMCID: PMC4200225 DOI: 10.1186/1471-2164-15-882] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 09/29/2014] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND There are several studies describing loss of genes through reductive evolution in microbes, but how selective forces are associated with genome expansion due to horizontal gene transfer (HGT) has not received similar attention. The aim of this study was therefore to examine how selective pressures influence genome expansion in 53 fully sequenced and assembled Escherichia coli strains. We also explored potential connections between genome expansion and the attainment of virulence factors. This was performed using estimations of several genomic parameters such as AT content, genomic drift (measured using relative entropy), genome size and estimated HGT size, which were subsequently compared to analogous parameters computed from the core genome consisting of 1729 genes common to the 53 E. coli strains. Moreover, we analyzed how selective pressures (quantified using relative entropy and dN/dS), acting on the E. coli core genome, influenced lineage and phylogroup formation. RESULTS Hierarchical clustering of dS and dN estimations from the E. coli core genome resulted in phylogenetic trees with topologies in agreement with known E. coli taxonomy and phylogroups. High values of dS, compared to dN, indicate that the E. coli core genome has been subjected to substantial purifying selection over time; significantly more than the non-core part of the genome (p<0.001). This is further supported by a linear association between strain-wise dS and dN values (β = 26.94 ± 0.44, R2~0.98, p<0.001). The non-core part of the genome was also significantly more AT-rich (p<0.001) than the core genome and E. coli genome size correlated with estimated HGT size (p<0.001). In addition, genome size (p<0.001), AT content (p<0.001) as well as estimated HGT size (p<0.005) were all associated with the presence of virulence factors, suggesting that pathogenicity traits in E. coli are largely attained through HGT. No associations were found between selective pressures operating on the E. coli core genome, as estimated using relative entropy, and genome size (p~0.98). CONCLUSIONS On a larger time frame, genome expansion in E. coli, which is significantly associated with the acquisition of virulence factors, appears to be independent of selective forces operating on the core genome.
Collapse
Affiliation(s)
- Jon Bohlin
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P,O, Box 4404, Oslo 0403, Norway.
| | | | | | | |
Collapse
|
13
|
Bohlin J, Sekse C, Skjerve E, Brynildsrud O. Positive correlations between genomic %AT and genome size within strains of bacterial species. ENVIRONMENTAL MICROBIOLOGY REPORTS 2014; 6:278-286. [PMID: 24983532 DOI: 10.1111/1758-2229.12145] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2013] [Accepted: 12/23/2013] [Indexed: 06/03/2023]
Abstract
Genomic %AT has been found to correlate negatively with genome size in microbes. While microbes with large genomes are often GC rich and free living, AT-rich bacteria tend to be host associated with smaller genomes. With over 2000 fully sequenced and assembled microbial genomes available, we explored the relationship among genomic %AT, genome size, relative entropy (a measure associated with genetic drift) and fraction of genome islands (GIs) in microbial species with the genomes of more than 10 strains available. A negative correlation with genome size was found in six out of 12 phylogenetic groups and subphyla and a positive correlation in only two. At the species level, we found a trend of positive correlations between genomic %AT and genome size in eight out of 20 species, while only four showed a negative correlation. Estimated chromosomal fractions of GIs were found to correlate positively with genome size in the strains of 14 out of 18 species and genomic %AT in the strains of seven species (two correlated negatively). Although GIs explain most of the observed positive correlations between genomic %AT and size, Chlamydia trachomatis seem to be an exception; therefore, these findings needs to be further explored.
Collapse
Affiliation(s)
- Jon Bohlin
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes Gate 6, P.O. Box 4404, 0403, Oslo, Norway; Epi-Centre, Department of Food-Safety and Infection Biology, Norwegian School of Veterinary Science, Ullevålsveien 72, P.O. Box 8146 Dep, NO-0033, Oslo, Norway
| | | | | | | |
Collapse
|
14
|
Bohlin J, Brynildsrud O, Vesth T, Skjerve E, Ussery DW. Amino acid usage is asymmetrically biased in AT- and GC-rich microbial genomes. PLoS One 2013; 8:e69878. [PMID: 23922837 PMCID: PMC3724673 DOI: 10.1371/journal.pone.0069878] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Accepted: 06/14/2013] [Indexed: 11/18/2022] Open
Abstract
INTRODUCTION Genomic base composition ranges from less than 25% AT to more than 85% AT in prokaryotes. Since only a small fraction of prokaryotic genomes is not protein coding even a minor change in genomic base composition will induce profound protein changes. We examined how amino acid and codon frequencies were distributed in over 2000 microbial genomes and how these distributions were affected by base compositional changes. In addition, we wanted to know how genome-wide amino acid usage was biased in the different genomes and how changes to base composition and mutations affected this bias. To carry this out, we used a Generalized Additive Mixed-effects Model (GAMM) to explore non-linear associations and strong data dependences in closely related microbes; principal component analysis (PCA) was used to examine genomic amino acid- and codon frequencies, while the concept of relative entropy was used to analyze genomic mutation rates. RESULTS We found that genomic amino acid frequencies carried a stronger phylogenetic signal than codon frequencies, but that this signal was weak compared to that of genomic %AT. Further, in contrast to codon usage bias (CUB), amino acid usage bias (AAUB) was differently distributed in AT- and GC-rich genomes in the sense that AT-rich genomes did not prefer specific amino acids over others to the same extent as GC-rich genomes. AAUB was also associated with relative entropy; genomes with low AAUB contained more random mutations as a consequence of relaxed purifying selection than genomes with higher AAUB. CONCLUSION Genomic base composition has a substantial effect on both amino acid- and codon frequencies in bacterial genomes. While phylogeny influenced amino acid usage more in GC-rich genomes, AT-content was driving amino acid usage in AT-rich genomes. We found the GAMM model to be an excellent tool to analyze the genomic data used in this study.
Collapse
Affiliation(s)
- Jon Bohlin
- Centre for Epidemiology and Biostatistics, Department of Food Safety and Infection Biology, Norwegian School of Veterinary Science, Oslo, Norway.
| | | | | | | | | |
Collapse
|