1
|
Bruneaux M, Kronholm I, Ashrafi R, Ketola T. Roles of adenine methylation and genetic mutations in adaptation to different temperatures in Serratia marcescens. Epigenetics 2021; 17:861-881. [PMID: 34519613 DOI: 10.1080/15592294.2021.1966215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022] Open
Abstract
Epigenetic modifications can contribute to adaptation, but the relative contributions of genetic and epigenetic variation are unknown. Previous studies on the role of epigenetic changes in adaptation in eukaryotes have nearly exclusively focused on cytosine methylation (m5C), while prokaryotes exhibit a richer system of methyltransferases targetting adenines (m6A) or cytosines (m4C, m5C). DNA methylation in prokaryotes has many roles, but its potential role in adaptation still needs further investigation. We collected phenotypic, genetic, and epigenetic data using single molecule real-time sequencing of clones of the bacterium Serratia marcescens that had undergone experimental evolution in contrasting temperatures to investigate the relationship between environment and genetic, epigenetic, and phenotypic changes. The genomic distribution of GATC motifs, which were the main target for m6A methylation, and of variable m6A epiloci pointed to a potential link between m6A methylation and regulation of gene expression in S. marcescens. Evolved strains, while genetically homogeneous, exhibited many polymorphic m6A epiloci. There was no strong support for a genetic control of methylation changes in our experiment, and no clear evidence of parallel environmentally induced or environmentally selected methylation changes at specific epiloci was found. Both genetic and epigenetic variants were associated with some phenotypic traits. Overall, our results suggest that both genetic and adenine methylation changes have the potential to contribute to phenotypic adaptation in S. marcescens, but that any environmentally induced epigenetic change occurring in our experiment would probably have been quite labile.
Collapse
Affiliation(s)
- Matthieu Bruneaux
- Department of Biological and Environmental Science, University of Jyväskylä, Jyväskylä, Finland
| | - Ilkka Kronholm
- Department of Biological and Environmental Science, University of Jyväskylä, Jyväskylä, Finland
| | - Roghaieh Ashrafi
- Department of Biological and Environmental Science, University of Jyväskylä, Jyväskylä, Finland
| | - Tarmo Ketola
- Department of Biological and Environmental Science, University of Jyväskylä, Jyväskylä, Finland
| |
Collapse
|
2
|
Duan B, Ding P, Navarre WW, Liu J, Xia B. Xenogeneic Silencing and Bacterial Genome Evolution: Mechanisms for DNA Recognition Imply Multifaceted Roles of Xenogeneic Silencers. Mol Biol Evol 2021; 38:4135-4148. [PMID: 34003286 PMCID: PMC8476142 DOI: 10.1093/molbev/msab136] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Revised: 04/08/2021] [Indexed: 12/14/2022] Open
Abstract
Horizontal gene transfer (HGT) is a major driving force for bacterial evolution. To avoid the deleterious effects due to the unregulated expression of newly acquired foreign genes, bacteria have evolved specific proteins named xenogeneic silencers to recognize foreign DNA sequences and suppress their transcription. As there is considerable diversity in genomic base compositions among bacteria, how xenogeneic silencers distinguish self- from nonself DNA in different bacteria remains poorly understood. This review summarizes the progress in studying the DNA binding preferences and the underlying molecular mechanisms of known xenogeneic silencer families, represented by H-NS of Escherichia coli, Lsr2 of Mycobacterium, MvaT of Pseudomonas, and Rok of Bacillus. Comparative analyses of the published data indicate that the differences in DNA recognition mechanisms enable these xenogeneic silencers to have clear characteristics in DNA sequence preferences, which are further correlated with different host genomic features. These correlations provide insights into the mechanisms of how these xenogeneic silencers selectively target foreign DNA in different genomic backgrounds. Furthermore, it is revealed that the genomic AT contents of bacterial species with the same xenogeneic silencer family proteins are distributed in a limited range and are generally lower than those species without any known xenogeneic silencers in the same phylum/class/genus, indicating that xenogeneic silencers have multifaceted roles on bacterial genome evolution. In addition to regulating horizontal gene transfer, xenogeneic silencers also act as a selective force against the GC to AT mutational bias found in bacterial genomes and help the host genomic AT contents maintained at relatively low levels.
Collapse
Affiliation(s)
- Bo Duan
- Beijing Nuclear Magnetic Resonance Center, College of Chemistry and Molecular Engineering, and School of Life Sciences, Peking University, Beijing, 100871, China
| | - Pengfei Ding
- Beijing Nuclear Magnetic Resonance Center, College of Chemistry and Molecular Engineering, and School of Life Sciences, Peking University, Beijing, 100871, China
| | - William Wiley Navarre
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, M5G 1M1, Canada
| | - Jun Liu
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, M5G 1M1, Canada
| | - Bin Xia
- Beijing Nuclear Magnetic Resonance Center, College of Chemistry and Molecular Engineering, and School of Life Sciences, Peking University, Beijing, 100871, China
| |
Collapse
|
3
|
Bize A, Midoux C, Mariadassou M, Schbath S, Forterre P, Da Cunha V. Exploring short k-mer profiles in cells and mobile elements from Archaea highlights the major influence of both the ecological niche and evolutionary history. BMC Genomics 2021; 22:186. [PMID: 33726663 PMCID: PMC7962313 DOI: 10.1186/s12864-021-07471-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Accepted: 02/24/2021] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND K-mer-based methods have greatly advanced in recent years, largely driven by the realization of their biological significance and by the advent of next-generation sequencing. Their speed and their independence from the annotation process are major advantages. Their utility in the study of the mobilome has recently emerged and they seem a priori adapted to the patchy gene distribution and the lack of universal marker genes of viruses and plasmids. To provide a framework for the interpretation of results from k-mer based methods applied to archaea or their mobilome, we analyzed the 5-mer DNA profiles of close to 600 archaeal cells, viruses and plasmids. Archaea is one of the three domains of life. Archaea seem enriched in extremophiles and are associated with a high diversity of viral and plasmid families, many of which are specific to this domain. We explored the dataset structure by multivariate and statistical analyses, seeking to identify the underlying factors. RESULTS For cells, the 5-mer profiles were inconsistent with the phylogeny of archaea. At a finer taxonomic level, the influence of the taxonomy and the environmental constraints on 5-mer profiles was very strong. These two factors were interdependent to a significant extent, and the respective weights of their contributions varied according to the clade. A convergent adaptation was observed for the class Halobacteria, for which a strong 5-mer signature was identified. For mobile elements, coevolution with the host had a clear influence on their 5-mer profile. This enabled us to identify one previously known and one new case of recent host transfer based on the atypical composition of the mobile elements involved. Beyond the effect of coevolution, extrachromosomal elements strikingly retain the specific imprint of their own viral or plasmid taxonomic family in their 5-mer profile. CONCLUSION This specific imprint confirms that the evolution of extrachromosomal elements is driven by multiple parameters and is not restricted to host adaptation. In addition, we detected only recent host transfer events, suggesting the fast evolution of short k-mer profiles. This calls for caution when using k-mers for host prediction, metagenomic binning or phylogenetic reconstruction.
Collapse
Affiliation(s)
- Ariane Bize
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.
| | - Cédric Midoux
- Université Paris-Saclay, INRAE, PROSE, F-92761, Antony, France.,Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Mahendra Mariadassou
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Sophie Schbath
- Université Paris-Saclay, INRAE, MaIAGE, F-78350, Jouy-en-Josas, France.,Université Paris-Saclay, INRAE, BioinfOmics, MIGALE bioinformatics facility, F-78350, Jouy-en-Josas, France
| | - Patrick Forterre
- Institut Pasteur, Unité de Virologie des Archées, Département de Microbiologie, 25 Rue du Docteur Roux, 75015, Paris, France. .,Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France.
| | - Violette Da Cunha
- Université Paris-Saclay, CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), 91198, Gif-sur-Yvette, France
| |
Collapse
|
4
|
Azlan A, Obeidat SM, Theva Das K, Yunus MA, Azzam G. Genome-wide identification of Aedes albopictus long noncoding RNAs and their association with dengue and Zika virus infection. PLoS Negl Trop Dis 2021; 15:e0008351. [PMID: 33481791 PMCID: PMC7872224 DOI: 10.1371/journal.pntd.0008351] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 02/09/2021] [Accepted: 11/20/2020] [Indexed: 12/14/2022] Open
Abstract
The Asian tiger mosquito, Aedes albopictus (Ae. albopictus), is an important vector that transmits arboviruses such as dengue (DENV), Zika (ZIKV) and Chikungunya virus (CHIKV). Long noncoding RNAs (lncRNAs) are known to regulate various biological processes. Knowledge on Ae. albopictus lncRNAs and their functional role in virus-host interactions are still limited. Here, we identified and characterized the lncRNAs in the genome of an arbovirus vector, Ae. albopictus, and evaluated their potential involvement in DENV and ZIKV infection. We used 148 public datasets, and identified a total of 10, 867 novel lncRNA transcripts, of which 5,809, 4,139, and 919 were intergenic, intronic and antisense respectively. The Ae. albopictus lncRNAs shared many characteristics with other species such as short length, low GC content, and low sequence conservation. RNA-sequencing of Ae. albopictus cells infected with DENV and ZIKV showed that the expression of lncRNAs was altered upon virus infection. Target prediction analysis revealed that Ae. albopictus lncRNAs may regulate the expression of genes involved in immunity and other metabolic and cellular processes. To verify the role of lncRNAs in virus infection, we generated mutations in lncRNA loci using CRISPR-Cas9, and discovered that two lncRNA loci mutations, namely XLOC_029733 (novel lncRNA transcript id: lncRNA_27639.2) and LOC115270134 (known lncRNA transcript id: XR_003899061.1) resulted in enhancement of DENV and ZIKV replication. The results presented here provide an important foundation for future studies of lncRNAs and their relationship with virus infection in Ae. albopictus. Ae. albopictus is an important vector of arboviruses such as dengue and Zika viruses. Studies on virus-host interaction at gene expression and molecular level are crucial especially in devising methods to inhibit virus replication in Aedes mosquitoes. Previous reports have shown that, besides protein-coding genes, noncoding RNAs such as lncRNAs are also involved in virus-host interaction. In this study, we report a comprehensive catalog of novel lncRNA transcripts in the genome of Ae. albopictus. We also show that the expression of lncRNAs was altered upon infection with dengue and Zika. Additionally, depletion of certain lncRNAs resulted in increased replication of dengue and Zika; hence, suggesting potential association of lncRNAs in virus infection. Results of this study provide a new avenue to the investigation of mosquito-virus interactions, especially in the aspect of noncoding genes.
Collapse
Affiliation(s)
- Azali Azlan
- School of Biological Sciences, Universiti Sains Malaysia, Penang, Malaysia
| | - Sattam M. Obeidat
- School of Biological Sciences, Universiti Sains Malaysia, Penang, Malaysia
| | - Kumitaa Theva Das
- Infectomics Cluster, Advanced Medical & Dental Institute, Universiti Sains Malaysia, Bertam, Kepala Batas, Pulau Pinang, Malaysia
| | - Muhammad Amir Yunus
- Infectomics Cluster, Advanced Medical & Dental Institute, Universiti Sains Malaysia, Bertam, Kepala Batas, Pulau Pinang, Malaysia
| | - Ghows Azzam
- School of Biological Sciences, Universiti Sains Malaysia, Penang, Malaysia
- * E-mail:
| |
Collapse
|
5
|
Revisiting the Relationships Between Genomic G + C Content, RNA Secondary Structures, and Optimal Growth Temperature. J Mol Evol 2020; 89:165-171. [PMID: 33216148 DOI: 10.1007/s00239-020-09974-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 11/09/2020] [Indexed: 10/23/2022]
Abstract
Over twenty years ago Galtier and Lobry published a manuscript entitled "Relationships between Genomic G + C Content, RNA Secondary Structure, and Optimal Growth Temperature" in the Journal of Molecular Evolution that showcased the lack of a relationship between genomic G + C content and optimal growth temperature (OGT) in a set of about 200 prokaryotes. Galtier and Lobry also assessed the relationship between RNA secondary structures (rRNA stems, tRNAs) and OGT, and in this case a clear relationship emerged. Increasing structured RNA G + C content (particularly in regions that are double-stranded) correlates with increased OGT. Both of these fundamental relationships have withstood test of many additional sequences and spawned a variety of different applications that include prediction of OGT from rRNA sequence and computational ncRNA identification approaches. In this work, I present the motivation behind Galtier and Lobry's original paper and the larger questions addressed by the work, how these questions have evolved over the last two decades, and the impact of Galtier and Lobry's manuscript in fields beyond these questions.
Collapse
|
6
|
Zhou Y, Zhang W, Wu H, Huang K, Jin J. A high-resolution genomic composition-based method with the ability to distinguish similar bacterial organisms. BMC Genomics 2019; 20:754. [PMID: 31638897 PMCID: PMC6805505 DOI: 10.1186/s12864-019-6119-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Accepted: 09/20/2019] [Indexed: 12/03/2022] Open
Abstract
Background Genomic composition has been found to be species specific and is used to differentiate bacterial species. To date, almost no published composition-based approaches are able to distinguish between most closely related organisms, including intra-genus species and intra-species strains. Thus, it is necessary to develop a novel approach to address this problem. Results Here, we initially determine that the “tetranucleotide-derived z-value Pearson correlation coefficient” (TETRA) approach is representative of other published statistical methods. Then, we devise a novel method called “Tetranucleotide-derived Z-value Manhattan Distance” (TZMD) and compare it with the TETRA approach. Our results show that TZMD reflects the maximal genome difference, while TETRA does not in most conditions, demonstrating in theory that TZMD provides improved resolution. Additionally, our analysis of real data shows that TZMD improves species differentiation and clearly differentiates similar organisms, including similar species belonging to the same genospecies, subspecies and intraspecific strains, most of which cannot be distinguished by TETRA. Furthermore, TZMD is able to determine clonal strains with the TZMD = 0 criterion, which intrinsically encompasses identical composition, high average nucleotide identity and high percentage of shared genomes. Conclusions Our extensive assessment demonstrates that TZMD has high resolution. This study is the first to propose a composition-based method for differentiating bacteria at the strain level and to demonstrate that composition is also strain specific. TZMD is a powerful tool and the first easy-to-use approach for differentiating clonal and non-clonal strains. Therefore, as the first composition-based algorithm for strain typing, TZMD will facilitate bacterial studies in the future.
Collapse
Affiliation(s)
- Yizhuang Zhou
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Peking-Tsinghua Center for Life Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, People's Republic of China.
| | - Wenting Zhang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Huixian Wu
- China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Kai Huang
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China
| | - Junfei Jin
- Laboratory of Hepatobiliary and Pancreatic Surgery, The Affiliated Hospital of Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,China-USA Lipids in Health and Disease Research Center, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China. .,Guangxi Key Laboratory of Molecular Medicine in Liver Injury and Repair, Guilin Medical University, Guilin, Guangxi, 541001, People's Republic of China.
| |
Collapse
|
7
|
Gophna U. The unbearable ease of expression-how avoidance of spurious transcription can shape G+C content in bacterial genomes. FEMS Microbiol Lett 2019; 365:5181332. [PMID: 30423131 DOI: 10.1093/femsle/fny267] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Accepted: 11/11/2018] [Indexed: 12/28/2022] Open
Affiliation(s)
- Uri Gophna
- Molecular Cell Biology and Biotechnology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Levanon Street, Tel Aviv 69000, Israel
| |
Collapse
|
8
|
Azlan A, Obeidat SM, Yunus MA, Azzam G. Systematic identification and characterization of Aedes aegypti long noncoding RNAs (lncRNAs). Sci Rep 2019; 9:12147. [PMID: 31434910 PMCID: PMC6704130 DOI: 10.1038/s41598-019-47506-9] [Citation(s) in RCA: 32] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 07/18/2019] [Indexed: 12/14/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) play diverse roles in biological processes. Aedes aegypti (Ae. aegypti), a blood-sucking mosquito, is the principal vector responsible for replication and transmission of arboviruses including dengue, Zika, and Chikungunya virus. Systematic identification and developmental characterisation of Ae. aegypti lncRNAs are still limited. We performed genome-wide identification of lncRNAs, followed by developmental profiling of lncRNA in Ae. aegypti. We identified a total of 4,689 novel lncRNA transcripts, of which 2,064, 2,076, and 549 were intergenic, intronic, and antisense respectively. Ae. aegypti lncRNAs share many characteristics with other species including low expression, low GC content, short in length, and low conservation. Besides, the expression of Ae. aegypti lncRNAs tend to be correlated with neighbouring and antisense protein-coding genes. A subset of lncRNAs shows evidence of maternal inheritance; hence, suggesting potential role of lncRNAs in early-stage embryos. Additionally, lncRNAs show higher tendency to be expressed in developmental and temporal specific manner. The results from this study provide foundation for future investigation on the function of Ae. aegypti lncRNAs.
Collapse
Affiliation(s)
- Azali Azlan
- School of Biological Sciences, Universiti Sains Malaysia, 11800, Penang, Malaysia
| | - Sattam M Obeidat
- School of Biological Sciences, Universiti Sains Malaysia, 11800, Penang, Malaysia
| | - Muhammad Amir Yunus
- Infectomics Cluster, Advanced Medical & Dental Institute, Universiti Sains Malaysia, Bertam, 13200, Kepala Batas, Penang, Malaysia
| | - Ghows Azzam
- School of Biological Sciences, Universiti Sains Malaysia, 11800, Penang, Malaysia.
| |
Collapse
|
9
|
Krawczyk PS, Lipinski L, Dziembowski A. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res 2019; 46:e35. [PMID: 29346586 PMCID: PMC5887522 DOI: 10.1093/nar/gkx1321] [Citation(s) in RCA: 276] [Impact Index Per Article: 55.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 12/28/2017] [Indexed: 12/14/2022] Open
Abstract
Plasmids are mobile genetics elements that play an important role in the environmental adaptation of microorganisms. Although plasmids are usually analyzed in cultured microorganisms, there is a need for methods that allow for the analysis of pools of plasmids (plasmidomes) in environmental samples. To that end, several molecular biology and bioinformatics methods have been developed; however, they are limited to environments with low diversity and cannot recover large plasmids. Here, we present PlasFlow, a novel tool based on genomic signatures that employs a neural network approach for identification of bacterial plasmid sequences in environmental samples. PlasFlow can recover plasmid sequences from assembled metagenomes without any prior knowledge of the taxonomical or functional composition of samples with an accuracy up to 96%. It can also recover sequences of both circular and linear plasmids and can perform initial taxonomical classification of sequences. Compared to other currently available tools, PlasFlow demonstrated significantly better performance on test datasets. Analysis of two samples from heavy metal-contaminated microbial mats revealed that plasmids may constitute an important fraction of their metagenomes and carry genes involved in heavy-metal homeostasis, proving the pivotal role of plasmids in microorganism adaptation to environmental conditions.
Collapse
Affiliation(s)
- Pawel S Krawczyk
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Leszek Lipinski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Andrzej Dziembowski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| |
Collapse
|
10
|
Bohlin J, Pettersson JHO. Evolution of Genomic Base Composition: From Single Cell Microbes to Multicellular Animals. Comput Struct Biotechnol J 2019; 17:362-370. [PMID: 30949307 PMCID: PMC6429543 DOI: 10.1016/j.csbj.2019.03.001] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2018] [Revised: 02/28/2019] [Accepted: 03/01/2019] [Indexed: 01/07/2023] Open
Abstract
Whole genome sequencing (WGS) of thousands of microbial genomes has provided considerable insight into evolutionary mechanisms in the microbial world. While substantially fewer eukaryotic genomes are available for analyses the number is rapidly increasing. This mini-review summarizes broadly evolutionary dynamics of base composition in the different domains of life from the perspective of prokaryotes. Common and different evolutionary mechanisms influencing genomic base composition in eukaryotes and prokaryotes are discussed. The conclusion from the data currently available suggests that while there are similarities there are also striking differences in how genomic base composition has evolved within prokaryotes and eukaryotes. For instance, homologous recombination appears to increase GC content locally in eukaryotes due to a non-selective process termed GC-biased gene conversion (gBGC). For prokaryotes on the other hand, increase in genomic GC content seems to be driven by the environment and selection. We find that similar phenomena observed for some organisms in each respective domain may be caused by very different mechanisms: while gBGC and recombination rates appear to explain the negative correlation between GC3 (GC content based on the third codon nucleotides) and genome size in some eukaryotes uptake of AT rich DNA sequences is the main reason for a similar negative correlation observed in prokaryotes. We provide further examples that indicate that base composition in prokaryotes and eukaryotes have evolved under very different constraints.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian Institute of Public Health, Division of Infection Control and Environmental Health, Department of Infectious Disease Epidemiology and Modelling, Lovisenberggata 8, 0456 Oslo, Norway.,Centre for Fertility and Health, Norwegian Institute of Public Health, PO-Box 222 Skøyen, N-0213 Oslo, Norway.,Norwegian University of Life Sciences, Faculty of Veterinary Sciences, Production Animal Clinical Sciences, Ullevålsveien 72, 0454 Oslo, Norway
| | - John H-O Pettersson
- Marie Bashir Institute for Infectious Diseases and Biosecurity, Charles Perkins Centre, School of Life and Environmental Sciences and Sydney Medical School the University of Sydney, New South Wales 2006, Australia.,Zoonosis Science Center, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden.,Public Health Agency of Sweden, Nobels vg 18, SE-171 82 Solna, Sweden
| |
Collapse
|
11
|
Almpanis A, Swain M, Gatherer D, McEwan N. Correlation between bacterial G+C content, genome size and the G+C content of associated plasmids and bacteriophages. Microb Genom 2018; 4:e000168. [PMID: 29633935 PMCID: PMC5989581 DOI: 10.1099/mgen.0.000168] [Citation(s) in RCA: 63] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 03/06/2018] [Indexed: 02/06/2023] Open
Abstract
Based on complete bacterial genome sequence data, we demonstrate a correlation between bacterial chromosome length and the G+C content of the genome, with longer genomes having higher G+C contents. The correlation value decreases at shorter genome sizes, where there is a wider spread of G+C values. However, although significant (P<0.001), the correlation value (Pearson R=0.58) suggests that other factors also have a significant influence. A similar pattern was seen for plasmids; longer plasmids had higher G+C values, although the large number of shorter plasmids had a wide spread of G+C values. There was also a significant (P<0.0001) correlation between the G+C content of plasmids and the G+C content of their bacterial host. Conversely, the G+C content of bacteriophages tended to reduce with larger genome sizes, and although there was a correlation between host genome G+C content and that of the bacteriophage, it was not as strong as that seen between plasmids and their hosts.
Collapse
Affiliation(s)
- Apostolos Almpanis
- Aberystwyth University, Aberystwyth, UK
- Newcastle University, Newcastle-upon-Tyne, UK
| | | | | | - Neil McEwan
- Aberystwyth University, Aberystwyth, UK
- School of Pharmacy and Life Sciences, Robert Gordon University, Aberdeen, UK
| |
Collapse
|
12
|
Akhter S, Aziz RK, Kashef MT, Ibrahim ES, Bailey B, Edwards RA. Kullback Leibler divergence in complete bacterial and phage genomes. PeerJ 2017; 5:e4026. [PMID: 29204318 PMCID: PMC5712468 DOI: 10.7717/peerj.4026] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2017] [Accepted: 10/22/2017] [Indexed: 12/11/2022] Open
Abstract
The amino acid content of the proteins encoded by a genome may predict the coding potential of that genome and may reflect lifestyle restrictions of the organism. Here, we calculated the Kullback–Leibler divergence from the mean amino acid content as a metric to compare the amino acid composition for a large set of bacterial and phage genome sequences. Using these data, we demonstrate that (i) there is a significant difference between amino acid utilization in different phylogenetic groups of bacteria and phages; (ii) many of the bacteria with the most skewed amino acid utilization profiles, or the bacteria that host phages with the most skewed profiles, are endosymbionts or parasites; (iii) the skews in the distribution are not restricted to certain metabolic processes but are common across all bacterial genomic subsystems; (iv) amino acid utilization profiles strongly correlate with GC content in bacterial genomes but very weakly correlate with the G+C percent in phage genomes. These findings might be exploited to distinguish coding from non-coding sequences in large data sets, such as metagenomic sequence libraries, to help in prioritizing subsequent analyses.
Collapse
Affiliation(s)
- Sajia Akhter
- Computational Science Research Center, San Diego State University, San Diego, CA, USA
| | - Ramy K Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America
| | - Mona T Kashef
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Eslam S Ibrahim
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, Egypt
| | - Barbara Bailey
- Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA
| | - Robert A Edwards
- Computational Science Research Center, San Diego State University, San Diego, CA, USA.,Department of Computer Science, San Diego State University, San Diego, CA, United States of America.,Department of Mathematics & Statistics, San Diego State University, San Diego, CA, USA.,Department of Biology, San Diego State University, San Diego, CA, USA
| |
Collapse
|
13
|
Systematic Identification and Molecular Characteristics of Long Noncoding RNAs in Pig Tissues. BIOMED RESEARCH INTERNATIONAL 2017; 2017:6152582. [PMID: 29062838 PMCID: PMC5618743 DOI: 10.1155/2017/6152582] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2017] [Revised: 07/26/2017] [Accepted: 08/08/2017] [Indexed: 12/15/2022]
Abstract
Long noncoding RNAs (lncRNAs) are non-protein-coding RNAs that are involved in a variety of biological processes. The pig is an important farm animal and an ideal biomedical model. In this study, we performed a genome-wide scan for lncRNAs in multiple tissue types from pigs. A total of 118 million paired-end 90 nt clean reads were obtained via strand-specific RNA sequencing, 80.4% of which were aligned to the pig reference genome. We developed a stringent bioinformatics pipeline to identify 2,139 high-quality multiexonic lncRNAs. The characteristic analysis revealed that the novel lncRNAs showed relatively shorter transcript length, fewer exons, and lower expression levels in comparison with protein-coding genes (PCGs). The guanine-cytosine (GC) content of the protein-coding exons and introns was significantly higher than that of the lncRNAs. Moreover, the single nucleotide polymorphism (SNP) density of lncRNAs was significantly higher than that of PCGs. Conservation analysis revealed that most lncRNAs were evolutionarily conserved among pigs, humans, and mice, such as CUFF.253988.1, which shares homology with human long noncoding RNA MALAT1. The findings of our study significantly increase the number of known lncRNAs in pigs.
Collapse
|
14
|
Bohlin J, Eldholm V, Pettersson JHO, Brynildsrud O, Snipen L. The nucleotide composition of microbial genomes indicates differential patterns of selection on core and accessory genomes. BMC Genomics 2017; 18:151. [PMID: 28187704 PMCID: PMC5303225 DOI: 10.1186/s12864-017-3543-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 02/02/2017] [Indexed: 12/02/2022] Open
Abstract
Background The core genome consists of genes shared by the vast majority of a species and is therefore assumed to have been subjected to substantially stronger purifying selection than the more mobile elements of the genome, also known as the accessory genome. Here we examine intragenic base composition differences in core genomes and corresponding accessory genomes in 36 species, represented by the genomes of 731 bacterial strains, to assess the impact of selective forces on base composition in microbes. We also explore, in turn, how these results compare with findings for whole genome intragenic regions. Results We found that GC content in coding regions is significantly higher in core genomes than accessory genomes and whole genomes. Likewise, GC content variation within coding regions was significantly lower in core genomes than in accessory genomes and whole genomes. Relative entropy in coding regions, measured as the difference between observed and expected trinucleotide frequencies estimated from mononucleotide frequencies, was significantly higher in the core genomes than in accessory and whole genomes. Relative entropy was positively associated with coding region GC content within the accessory genomes, but not within the corresponding coding regions of core or whole genomes. Conclusion The higher intragenic GC content and relative entropy, as well as the lower GC content variation, observed in the core genomes is most likely associated with selective constraints. It is unclear whether the positive association between GC content and relative entropy in the more mobile accessory genomes constitutes signatures of selection or selective neutral processes. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3543-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jon Bohlin
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway.
| | - Vegard Eldholm
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - John H O Pettersson
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Ola Brynildsrud
- Infectious Disease Control and Environmental Health, Norwegian Institute of Public Health, Lovisenberggata 8, P.O. Box 4404, 0403, Oslo, Norway
| | - Lars Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, 1430, Ås, Norway
| |
Collapse
|
15
|
Díez-Vives C, Moitinho-Silva L, Nielsen S, Reynolds D, Thomas T. Expression of eukaryotic-like protein in the microbiome of sponges. Mol Ecol 2017; 26:1432-1451. [PMID: 28036141 DOI: 10.1111/mec.14003] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2016] [Revised: 12/08/2016] [Accepted: 12/09/2016] [Indexed: 01/04/2023]
Abstract
Eukaryotic-like proteins (ELPs) are classes of proteins that are found in prokaryotes, but have a likely evolutionary origin in eukaryotes. ELPs have been postulated to mediate host-microbiome interactions. Recent work has discovered that prokaryotic symbionts of sponges contain abundant and diverse genes for ELPs, which could modulate interactions with their filter-feeding and phagocytic host. However, the extent to which these ELP genes are actually used and expressed by the symbionts is poorly understood. Here, we use metatranscriptomics to investigate ELP expression in the microbiomes of three different sponges (Cymbastella concentrica, Scopalina sp. and Tedania anhelens). We developed a workflow with optimized rRNA removal and in silico subtraction of host sequences to obtain a reliable symbiont metatranscriptome. This showed that between 1.3% and 2.3% of all symbiont transcripts contain genes for ELPs. Two classes of ELPs (cadherin and tetratricopeptide repeats) were abundantly expressed in the C. concentrica and Scopalina sp. microbiomes, while ankyrin repeat ELPs were predominant in the T. anhelens metatranscriptome. Comparison with transcripts that do not encode ELPs indicated a constitutive expression of ELPs across a range of bacterial and archaeal symbionts. Expressed ELPs also contained domains involved in protein secretion and/or were co-expressed with proteins involved in extracellular transport. This suggests these ELPs are likely exported, which could allow for direct interaction with the sponge. Our study shows that ELP genes in sponge symbionts represent actively expressed functions that could mediate molecular interaction between symbiosis partners.
Collapse
Affiliation(s)
- C Díez-Vives
- Centre for Marine Bio-Innovation, The University of New South Wales, Sydney, NSW, Australia
| | - L Moitinho-Silva
- Centre for Marine Bio-Innovation, The University of New South Wales, Sydney, NSW, Australia
| | - S Nielsen
- Centre for Marine Bio-Innovation, The University of New South Wales, Sydney, NSW, Australia
| | - D Reynolds
- Centre for Marine Bio-Innovation, The University of New South Wales, Sydney, NSW, Australia
| | - T Thomas
- Centre for Marine Bio-Innovation, The University of New South Wales, Sydney, NSW, Australia
| |
Collapse
|
16
|
Samchenko AA, Kiselev SS, Kabanov AV, Kondratjev MS, Komarov VM. On the nature of the domination of oligomeric (dA:dT) n tracts in the structure of eukaryotic genomes. Biophysics (Nagoya-shi) 2016. [DOI: 10.1134/s0006350916060233] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
17
|
Mehmood T, Bohlin J, Snipen L. A Partial Least Squares Based Procedure for Upstream Sequence Classification in Prokaryotes. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:560-567. [PMID: 26357267 DOI: 10.1109/tcbb.2014.2366146] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The upstream region of coding genes is important for several reasons, for instance locating transcription factor, binding sites, and start site initiation in genomic DNA. Motivated by a recently conducted study, where multivariate approach was successfully applied to coding sequence modeling, we have introduced a partial least squares (PLS) based procedure for the classification of true upstream prokaryotic sequence from background upstream sequence. The upstream sequences of conserved coding genes over genomes were considered in analysis, where conserved coding genes were found by using pan-genomics concept for each considered prokaryotic species. PLS uses position specific scoring matrix (PSSM) to study the characteristics of upstream region. Results obtained by PLS based method were compared with Gini importance of random forest (RF) and support vector machine (SVM), which is much used method for sequence classification. The upstream sequence classification performance was evaluated by using cross validation, and suggested approach identifies prokaryotic upstream region significantly better to RF (p-value < 0.01) and SVM (p-value < 0.01). Further, the proposed method also produced results that concurred with known biological characteristics of the upstream region.
Collapse
|
18
|
Broin PÓ, Smith TJ, Golden AA. Alignment-free clustering of transcription factor binding motifs using a genetic-k-medoids approach. BMC Bioinformatics 2015; 16:22. [PMID: 25627106 PMCID: PMC4384390 DOI: 10.1186/s12859-015-0450-2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2014] [Accepted: 01/02/2015] [Indexed: 11/10/2022] Open
Abstract
Background Familial binding profiles (FBPs) represent the average binding specificity for a group of structurally related DNA-binding proteins. The construction of such profiles allows the classification of novel motifs based on similarity to known families, can help to reduce redundancy in motif databases and de novo prediction algorithms, and can provide valuable insights into the evolution of binding sites. Many current approaches to automated motif clustering rely on progressive tree-based techniques, and can suffer from so-called frozen sub-alignments, where motifs which are clustered early on in the process remain ‘locked’ in place despite the potential for better placement at a later stage. In order to avoid this scenario, we have developed a genetic-k-medoids approach which allows motifs to move freely between clusters at any point in the clustering process. Results We demonstrate the performance of our algorithm, GMACS, on multiple benchmark motif datasets, comparing results obtained with current leading approaches. The first dataset includes 355 position weight matrices from the TRANSFAC database and indicates that the k-mer frequency vector approach used in GMACS outperforms other motif comparison techniques. We then cluster a set of 79 motifs from the JASPAR database previously used in several motif clustering studies and demonstrate that GMACS can produce a higher number of structurally homogeneous clusters than other methods without the need for a large number of singletons. Finally, we show the robustness of our algorithm to noise on multiple synthetic datasets consisting of known motifs convolved with varying degrees of noise. Conclusions Our proposed algorithm is generally applicable to any DNA or protein motifs, can produce highly stable and biologically meaningful clusters, and, by avoiding the problem of frozen sub-alignments, can provide improved results when compared with existing techniques on benchmark datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0450-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pilib Ó Broin
- Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, New York, 10461, USA. .,National Centre for Biomedical Engineering Science, National University of Ireland, University Road, Galway, Ireland.
| | - Terry J Smith
- Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, New York, 10461, USA.
| | - Aaron Aj Golden
- Department of Genetics, Albert Einstein College of Medicine, 1301 Morris Park Avenue, Bronx, New York, 10461, USA. .,Department of Mathematical Sciences, Yeshiva University, New York, 10033, NY, USA.
| |
Collapse
|
19
|
Bohlin J, Brynildsrud OB, Sekse C, Snipen L. An evolutionary analysis of genome expansion and pathogenicity in Escherichia coli. BMC Genomics 2014; 15:882. [PMID: 25297974 PMCID: PMC4200225 DOI: 10.1186/1471-2164-15-882] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2014] [Accepted: 09/29/2014] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND There are several studies describing loss of genes through reductive evolution in microbes, but how selective forces are associated with genome expansion due to horizontal gene transfer (HGT) has not received similar attention. The aim of this study was therefore to examine how selective pressures influence genome expansion in 53 fully sequenced and assembled Escherichia coli strains. We also explored potential connections between genome expansion and the attainment of virulence factors. This was performed using estimations of several genomic parameters such as AT content, genomic drift (measured using relative entropy), genome size and estimated HGT size, which were subsequently compared to analogous parameters computed from the core genome consisting of 1729 genes common to the 53 E. coli strains. Moreover, we analyzed how selective pressures (quantified using relative entropy and dN/dS), acting on the E. coli core genome, influenced lineage and phylogroup formation. RESULTS Hierarchical clustering of dS and dN estimations from the E. coli core genome resulted in phylogenetic trees with topologies in agreement with known E. coli taxonomy and phylogroups. High values of dS, compared to dN, indicate that the E. coli core genome has been subjected to substantial purifying selection over time; significantly more than the non-core part of the genome (p<0.001). This is further supported by a linear association between strain-wise dS and dN values (β = 26.94 ± 0.44, R2~0.98, p<0.001). The non-core part of the genome was also significantly more AT-rich (p<0.001) than the core genome and E. coli genome size correlated with estimated HGT size (p<0.001). In addition, genome size (p<0.001), AT content (p<0.001) as well as estimated HGT size (p<0.005) were all associated with the presence of virulence factors, suggesting that pathogenicity traits in E. coli are largely attained through HGT. No associations were found between selective pressures operating on the E. coli core genome, as estimated using relative entropy, and genome size (p~0.98). CONCLUSIONS On a larger time frame, genome expansion in E. coli, which is significantly associated with the acquisition of virulence factors, appears to be independent of selective forces operating on the core genome.
Collapse
Affiliation(s)
- Jon Bohlin
- Division of Epidemiology, Norwegian Institute of Public Health, Marcus Thranes gate 6, P,O, Box 4404, Oslo 0403, Norway.
| | | | | | | |
Collapse
|
20
|
Fullmer MS, Soucy SM, Swithers KS, Makkay AM, Wheeler R, Ventosa A, Gogarten JP, Papke RT. Population and genomic analysis of the genus Halorubrum. Front Microbiol 2014; 5:140. [PMID: 24782836 PMCID: PMC3990103 DOI: 10.3389/fmicb.2014.00140] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2014] [Accepted: 03/18/2014] [Indexed: 11/13/2022] Open
Abstract
The Halobacteria are known to engage in frequent gene transfer and homologous recombination. For stably diverged lineages to persist some checks on the rate of between lineage recombination must exist. We surveyed a group of isolates from the Aran-Bidgol endorheic lake in Iran and sequenced a selection of them. Multilocus Sequence Analysis (MLSA) and Average Nucleotide Identity (ANI) revealed multiple clusters (phylogroups) of organisms present in the lake. Patterns of intein and Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) presence/absence and their sequence similarity, GC usage along with the ANI and the identities of the genes used in the MLSA revealed that two of these clusters share an exchange bias toward others in their phylogroup while showing reduced rates of exchange with other organisms in the environment. However, a third cluster, composed in part of named species from other areas of central Asia, displayed many indications of variability in exchange partners, from within the lake as well as outside the lake. We conclude that barriers to gene exchange exist between the two purely Aran-Bidgol phylogroups, and that the third cluster with members from other regions is not a single population and likely reflects an amalgamation of several populations.
Collapse
Affiliation(s)
- Matthew S. Fullmer
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Shannon M. Soucy
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Kristen S. Swithers
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
- Department of Cell Biology, Yale School of Medicine, Yale UniversityNew Haven, CT, USA
| | - Andrea M. Makkay
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Ryan Wheeler
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - Antonio Ventosa
- Department of Microbiology and Parasitology, University of SevilleSeville, Spain
| | - J. Peter Gogarten
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| | - R. Thane Papke
- Department of Molecular and Cell Biology, University of ConnecticutStorrs, CT, USA
| |
Collapse
|
21
|
Bohlin J, Brynildsrud O, Vesth T, Skjerve E, Ussery DW. Amino acid usage is asymmetrically biased in AT- and GC-rich microbial genomes. PLoS One 2013; 8:e69878. [PMID: 23922837 PMCID: PMC3724673 DOI: 10.1371/journal.pone.0069878] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Accepted: 06/14/2013] [Indexed: 11/18/2022] Open
Abstract
INTRODUCTION Genomic base composition ranges from less than 25% AT to more than 85% AT in prokaryotes. Since only a small fraction of prokaryotic genomes is not protein coding even a minor change in genomic base composition will induce profound protein changes. We examined how amino acid and codon frequencies were distributed in over 2000 microbial genomes and how these distributions were affected by base compositional changes. In addition, we wanted to know how genome-wide amino acid usage was biased in the different genomes and how changes to base composition and mutations affected this bias. To carry this out, we used a Generalized Additive Mixed-effects Model (GAMM) to explore non-linear associations and strong data dependences in closely related microbes; principal component analysis (PCA) was used to examine genomic amino acid- and codon frequencies, while the concept of relative entropy was used to analyze genomic mutation rates. RESULTS We found that genomic amino acid frequencies carried a stronger phylogenetic signal than codon frequencies, but that this signal was weak compared to that of genomic %AT. Further, in contrast to codon usage bias (CUB), amino acid usage bias (AAUB) was differently distributed in AT- and GC-rich genomes in the sense that AT-rich genomes did not prefer specific amino acids over others to the same extent as GC-rich genomes. AAUB was also associated with relative entropy; genomes with low AAUB contained more random mutations as a consequence of relaxed purifying selection than genomes with higher AAUB. CONCLUSION Genomic base composition has a substantial effect on both amino acid- and codon frequencies in bacterial genomes. While phylogeny influenced amino acid usage more in GC-rich genomes, AT-content was driving amino acid usage in AT-rich genomes. We found the GAMM model to be an excellent tool to analyze the genomic data used in this study.
Collapse
Affiliation(s)
- Jon Bohlin
- Centre for Epidemiology and Biostatistics, Department of Food Safety and Infection Biology, Norwegian School of Veterinary Science, Oslo, Norway.
| | | | | | | | | |
Collapse
|
22
|
Dutta C, Paul S. Microbial lifestyle and genome signatures. Curr Genomics 2012; 13:153-62. [PMID: 23024607 PMCID: PMC3308326 DOI: 10.2174/138920212799860698] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2011] [Revised: 09/13/2011] [Accepted: 09/28/2011] [Indexed: 12/29/2022] Open
Abstract
Microbes are known for their unique ability to adapt to varying lifestyle and environment, even to the extreme or adverse ones. The genomic architecture of a microbe may bear the signatures not only of its phylogenetic position, but also of the kind of lifestyle to which it is adapted. The present review aims to provide an account of the specific genome signatures observed in microbes acclimatized to distinct lifestyles or ecological niches. Niche-specific signatures identified at different levels of microbial genome organization like base composition, GC-skew, purine-pyrimidine ratio, dinucleotide abundance, codon bias, oligonucleotide composition etc. have been discussed. Among the specific cases highlighted in the review are the phenomena of genome shrinkage in obligatory host-restricted microbes, genome expansion in strictly intra-amoebal pathogens, strand-specific codon usage in intracellular species, acquisition of genome islands in pathogenic or symbiotic organisms, discriminatory genomic traits of marine microbes with distinct trophic strategies, and conspicuous sequence features of certain extremophiles like those adapted to high temperature or high salinity.
Collapse
Affiliation(s)
- Chitra Dutta
- Structural Biology & Bioinformatics Division, CSIR- Indian Institute of Chemical Biology, 4, Raja S. C. Mullick Road, Kolkata 700032, India
| | | |
Collapse
|
23
|
Bohlin J, van Passel MWJ, Snipen L, Kristoffersen AB, Ussery D, Hardy SP. Relative entropy differences in bacterial chromosomes, plasmids, phages and genomic islands. BMC Genomics 2012; 13:66. [PMID: 22325062 PMCID: PMC3305612 DOI: 10.1186/1471-2164-13-66] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2011] [Accepted: 02/10/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND We sought to assess whether the concept of relative entropy (information capacity), could aid our understanding of the process of horizontal gene transfer in microbes. We analyzed the differences in information capacity between prokaryotic chromosomes, genomic islands (GI), phages, and plasmids. Relative entropy was estimated using the Kullback-Leibler measure. RESULTS Relative entropy was highest in bacterial chromosomes and had the sequence chromosomes > GI > phage > plasmid. There was an association between relative entropy and AT content in chromosomes, phages, plasmids and GIs with the strongest association being in phages. Relative entropy was also found to be lower in the obligate intracellular Mycobacterium leprae than in the related M. tuberculosis when measured on a shared set of highly conserved genes. CONCLUSIONS We argue that relative entropy differences reflect how plasmids, phages and GIs interact with microbial host chromosomes and that all these biological entities are, or have been, subjected to different selective pressures. The rate at which amelioration of horizontally acquired DNA occurs within the chromosome is likely to account for the small differences between chromosomes and stably incorporated GIs compared to the transient or independent replicons such as phages and plasmids.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian School of Veterinary Science, EpiCentre, Department of Food Safety and Infection biology, Ullevålsveien 72, Oslo, Norway.
| | | | | | | | | | | |
Collapse
|
24
|
Lamelas A, Gosalbes MJ, Manzano-Marín A, Peretó J, Moya A, Latorre A. Serratia symbiotica from the aphid Cinara cedri: a missing link from facultative to obligate insect endosymbiont. PLoS Genet 2011; 7:e1002357. [PMID: 22102823 PMCID: PMC3213167 DOI: 10.1371/journal.pgen.1002357] [Citation(s) in RCA: 141] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2011] [Accepted: 09/10/2011] [Indexed: 02/07/2023] Open
Abstract
The genome sequencing of Buchnera aphidicola BCc from the aphid Cinara cedri, which is the smallest known Buchnera genome, revealed that this bacterium had lost its symbiotic role, as it was not able to synthesize tryptophan and riboflavin. Moreover, the biosynthesis of tryptophan is shared with the endosymbiont Serratia symbiotica SCc, which coexists with B. aphidicola in this aphid. The whole-genome sequencing of S. symbiotica SCc reveals an endosymbiont in a stage of genome reduction that is closer to an obligate endosymbiont, such as B. aphidicola from Acyrthosiphon pisum, than to another S. symbiotica, which is a facultative endosymbiont in this aphid, and presents much less gene decay. The comparison between both S. symbiotica enables us to propose an evolutionary scenario of the transition from facultative to obligate endosymbiont. Metabolic inferences of B. aphidicola BCc and S. symbiotica SCc reveal that most of the functions carried out by B. aphidicola in A. pisum are now either conserved in B. aphidicola BCc or taken over by S. symbiotica. In addition, there are several cases of metabolic complementation giving functional stability to the whole consortium and evolutionary preservation of the actors involved.
Collapse
Affiliation(s)
- Araceli Lamelas
- Institut Cavanilles de Biodiversitat i Biologia Evolutiva, Universitat de València, Valencia, Spain
| | | | | | | | | | | |
Collapse
|
25
|
Zheng H, Wu H. Short prokaryotic DNA fragment binning using a hierarchical classifier based on linear discriminant analysis and principal component analysis. J Bioinform Comput Biol 2011; 8:995-1011. [PMID: 21121023 DOI: 10.1142/s0219720010005051] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2010] [Accepted: 07/31/2010] [Indexed: 11/18/2022]
Abstract
Metagenomics is an emerging field in which the power of genomic analysis is applied to an entire microbial community, bypassing the need to isolate and culture individual microbial species. Assembling of metagenomic DNA fragments is very much like the overlap-layout-consensus procedure for assembling isolated genomes, but is augmented by an additional binning step to differentiate scaffolds, contigs and unassembled reads into various taxonomic groups. In this paper, we employed n-mer oligonucleotide frequencies as the features and developed a hierarchical classifier (PCAHIER) for binning short (≤ 1,000 bps) metagenomic fragments. The principal component analysis was used to reduce the high dimensionality of the feature space. The hierarchical classifier consists of four layers of local classifiers that are implemented based on the linear discriminant analysis. These local classifiers are responsible for binning prokaryotic DNA fragments into superkingdoms, of the same superkingdom into phyla, of the same phylum into genera, and of the same genus into species, respectively. We evaluated the performance of the PCAHIER by using our own simulated data sets as well as the widely used simHC synthetic metagenome data set from the IMG/M system. The effectiveness of the PCAHIER was demonstrated through comparisons against a non-hierarchical classifier, and two existing binning algorithms (TETRA and Phylopythia).
Collapse
Affiliation(s)
- Hao Zheng
- School of Electrical and Computer Engineering, Georgia Institute of Technology, 210 Technology Circle, Savannah, GA 31407, USA.
| | | |
Collapse
|
26
|
Zheng H, Wu H. Gene-centric association analysis for the correlation between the guanine-cytosine content levels and temperature range conditions of prokaryotic species. BMC Bioinformatics 2010; 11 Suppl 11:S7. [PMID: 21172057 PMCID: PMC3024870 DOI: 10.1186/1471-2105-11-s11-s7] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background The environment has been playing an instrumental role in shaping and maintaining the morphological, physiological and biochemical diversities of prokaryotes. It has been debatable whether the whole-genome Guanine-Cytosine (GC) content levels of prokaryotic organisms are correlated with their optimal growth temperatures. Since the GC content is variable within a genome, we here focus on the correlation between the genic GC content levels and the temperature range conditions of prokaryotic organisms. Results The GC content levels in the coding regions of four genes were consistently identified as correlated with the temperature range condition when the association analysis was applied to (i) the 722 mesophilic and 93 thermophilic/hyperthermophilic organisms regardless of their phylogeny, oxygen requirement, salinity, or habitat conditions, and (ii) partial lists of organisms when organisms with certain phylogeny, oxygen requirement, salinity or habitat conditions were excluded. These four genes are K01251 (adenosylhomocysteinase), K03724 (DNA repair and recombination proteins), K07588 (LAO/AO transport system kinase), and K09122 (hypothetical protein). To further validate the identified correlation relationships, we examined to what extent the temperature range condition of an organism can be predicted based on the GC content levels in the coding regions of the selected genes. The 84.52% accuracy for the complete genomes, the 84.09% accuracy for the in-progress genomes, and 82.70% accuracy for the metagenomes, especially when being compared to the 50% accuracy rendered by random guessing, suggested that the temperature range condition of a prokaryotic organism can generally be predicted based on the GC content levels of the selected genomic regions. Conclusions The results rendered by various statistical tests and prediction tests indicated that the GC content levels of the coding/non-coding regions of certain genes are highly likely to be correlated with the temperature range conditions of prokaryotic organisms. Therefore, it is promising to carry out “reverse ecology” and to complete the ecological characterizations of prokaryotic organisms, i.e., to infer their temperature range conditions based on the GC content levels of certain genomic regions.
Collapse
Affiliation(s)
- Hao Zheng
- School of Electrical and Computer Engineering, Georgia Institute of Technology, USA.
| | | |
Collapse
|
27
|
Kelley DR, Salzberg SL. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics 2010; 11:544. [PMID: 21044341 PMCID: PMC3098094 DOI: 10.1186/1471-2105-11-544] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2010] [Accepted: 11/02/2010] [Indexed: 01/16/2023] Open
Abstract
BACKGROUND Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects. RESULTS We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available. CONCLUSIONS SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.
Collapse
Affiliation(s)
- David R Kelley
- Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, College Park, MD 20742, USA
- Department of Computer Science, University of Maryland, A.V. Williams Building College Park, MD 20742, USA
| | - Steven L Salzberg
- Center for Bioinformatics and Computational Biology, Institute for Advanced Computer Studies, College Park, MD 20742, USA
- Department of Computer Science, University of Maryland, A.V. Williams Building College Park, MD 20742, USA
| |
Collapse
|
28
|
Dutta A, Paul S, Dutta C. GC-rich intra-operonic spacers in prokaryotes: Possible relation to gene order conservation. FEBS Lett 2010; 584:4633-8. [DOI: 10.1016/j.febslet.2010.10.037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Revised: 10/12/2010] [Accepted: 10/15/2010] [Indexed: 11/28/2022]
|
29
|
Rangannan V, Bansal M. High-quality annotation of promoter regions for 913 bacterial genomes. ACTA ACUST UNITED AC 2010; 26:3043-50. [PMID: 20956245 DOI: 10.1093/bioinformatics/btq577] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
MOTIVATION The number of bacterial genomes being sequenced is increasing very rapidly and hence, it is crucial to have procedures for rapid and reliable annotation of their functional elements such as promoter regions, which control the expression of each gene or each transcription unit of the genome. The present work addresses this requirement and presents a generic method applicable across organisms. RESULTS Relative stability of the DNA double helical sequences has been used to discriminate promoter regions from non-promoter regions. Based on the difference in stability between neighboring regions, an algorithm has been implemented to predict promoter regions on a large scale over 913 microbial genome sequences. The average free energy values for the promoter regions as well as their downstream regions are found to differ, depending on their GC content. Threshold values to identify promoter regions have been derived using sequences flanking a subset of translation start sites from all microbial genomes and then used to predict promoters over the complete genome sequences. An average recall value of 72% (which indicates the percentage of protein and RNA coding genes with predicted promoter regions assigned to them) and precision of 56% is achieved over the 913 microbial genome dataset. AVAILABILITY The binary executable for 'PromPredict' algorithm (implemented in PERL and supported on Linux and MS Windows) and the predicted promoter data for all 913 microbial genomes are available at http://nucleix.mbu.iisc.ernet.in/prombase/.
Collapse
|
30
|
Bohlin J, Snipen L, Cloeckaert A, Lagesen K, Ussery D, Kristoffersen AB, Godfroid J. Genomic comparisons of Brucella spp. and closely related bacteria using base compositional and proteome based methods. BMC Evol Biol 2010; 10:249. [PMID: 20707916 PMCID: PMC2928237 DOI: 10.1186/1471-2148-10-249] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2010] [Accepted: 08/13/2010] [Indexed: 11/30/2022] Open
Abstract
Background Classification of bacteria within the genus Brucella has been difficult due in part to considerable genomic homogeneity between the different species and biovars, in spite of clear differences in phenotypes. Therefore, many different methods have been used to assess Brucella taxonomy. In the current work, we examine 32 sequenced genomes from genus Brucella representing the six classical species, as well as more recently described species, using bioinformatical methods. Comparisons were made at the level of genomic DNA using oligonucleotide based methods (Markov chain based genomic signatures, genomic codon and amino acid frequencies based comparisons) and proteomes (all-against-all BLAST protein comparisons and pan-genomic analyses). Results We found that the oligonucleotide based methods gave different results compared to that of the proteome based methods. Differences were also found between the oligonucleotide based methods used. Whilst the Markov chain based genomic signatures grouped the different species in genus Brucella according to host preference, the codon and amino acid frequencies based methods reflected small differences between the Brucella species. Only minor differences could be detected between all genera included in this study using the codon and amino acid frequencies based methods. Proteome comparisons were found to be in strong accordance with current Brucella taxonomy indicating a remarkable association between gene gain or loss on one hand and mutations in marker genes on the other. The proteome based methods found greater similarity between Brucella species and Ochrobactrum species than between species within genus Agrobacterium compared to each other. In other words, proteome comparisons of species within genus Agrobacterium were found to be more diverse than proteome comparisons between species in genus Brucella and genus Ochrobactrum. Pan-genomic analyses indicated that uptake of DNA from outside genus Brucella appears to be limited. Conclusions While both the proteome based methods and the Markov chain based genomic signatures were able to reflect environmental diversity between the different species and strains of genus Brucella, the genomic codon and amino acid frequencies based comparisons were not found adequate for such comparisons. The proteome comparison based phylogenies of the species in genus Brucella showed a surprising consistency with current Brucella taxonomy.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian School of Veterinary Science, Department of Food Safety and Infection Biology, Epicenter, Ullevålsveien 72, PO Box 8146 Dep, NO-0033 Oslo, Norway.
| | | | | | | | | | | | | |
Collapse
|
31
|
Bohlin J, Snipen L, Hardy SP, Kristoffersen AB, Lagesen K, Dønsvik T, Skjerve E, Ussery DW. Analysis of intra-genomic GC content homogeneity within prokaryotes. BMC Genomics 2010; 11:464. [PMID: 20691090 PMCID: PMC3091660 DOI: 10.1186/1471-2164-11-464] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2010] [Accepted: 08/06/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Bacterial genomes possess varying GC content (total guanines (Gs) and cytosines (Cs) per total of the four bases within the genome) but within a given genome, GC content can vary locally along the chromosome, with some regions significantly more or less GC rich than on average. We have examined how the GC content varies within microbial genomes to assess whether this property can be associated with certain biological functions related to the organism's environment and phylogeny. We utilize a new quantity GCVAR, the intra-genomic GC content variability with respect to the average GC content of the total genome. A low GCVAR indicates intra-genomic GC homogeneity and high GCVAR heterogeneity. RESULTS The regression analyses indicated that GCVAR was significantly associated with domain (i.e. archaea or bacteria), phylum, and oxygen requirement. GCVAR was significantly higher among anaerobes than both aerobic and facultative microbes. Although an association has previously been found between mean genomic GC content and oxygen requirement, our analysis suggests that no such association exits when phylogenetic bias is accounted for. A significant association between GCVAR and mean GC content was also found but appears to be non-linear and varies greatly among phyla. CONCLUSIONS Our findings show that GCVAR is linked with oxygen requirement, while mean genomic GC content is not. We therefore suggest that GCVAR should be used as a complement to mean GC content.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian School of Veterinary Science, Department of Food Safety and Infection Biology, Ullevålsveien 72, P,O, Box 8146 Dep, NO-0033 Oslo, Norway.
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Davenport C, Ussery DW, Tümmler B. Comparative genomics of green sulfur bacteria. PHOTOSYNTHESIS RESEARCH 2010; 104:137-152. [PMID: 20099081 DOI: 10.1007/s11120-009-9515-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2009] [Accepted: 12/07/2009] [Indexed: 05/28/2023]
Abstract
Eleven completely sequenced Chlorobi genomes were compared in oligonucleotide usage, gene contents, and synteny. The green sulfur bacteria (GSB) are equipped with a core genome that sustains their anoxygenic phototrophic lifestyle by photosynthesis, sulfur oxidation, and CO(2) fixation. Whole-genome gene family and single gene sequence comparisons yielded similar phylogenetic trees of the sequenced chromosomes indicating a concerted vertical evolution of large gene sets. Chromosomal synteny of genes is not preserved in the phylum Chlorobi. The accessory genome is characterized by anomalous oligonucleotide usage and endows the strains with individual features for transport, secretion, cell wall, extracellular constituents, and a few elements of the biosynthetic apparatus. Giant genes are a peculiar feature of the genera Chlorobium and Prosthecochloris. The predicted proteins have a huge molecular weight of 10(6), and are probably instrumental for the bacteria to generate their own intimate (micro)environment.
Collapse
Affiliation(s)
- Colin Davenport
- Klinische Forschergruppe, Klinik für Pädiatrische Pneumologie und Neonatologie, Medizinische Hochschule Hannover, Carl-Neuberg-Strasse 1, Hannover, Germany
| | | | | |
Collapse
|
33
|
Davenport CF, Tümmler B. Abundant oligonucleotides common to most bacteria. PLoS One 2010; 5:e9841. [PMID: 20352124 PMCID: PMC2843746 DOI: 10.1371/journal.pone.0009841] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2009] [Accepted: 03/03/2010] [Indexed: 11/25/2022] Open
Abstract
Background Bacteria show a bias in their genomic oligonucleotide composition far beyond that dictated by G+C content. Patterns of over- and underrepresented oligonucleotides carry a phylogenetic signal and are thus diagnostic for individual species. Patterns of short oligomers have been investigated by multiple groups in large numbers of bacteria genomes. However, global distributions of the most highly overrepresented mid-sized oligomers have not been assessed across all prokaryotes to date. We surveyed overrepresented mid-length oligomers across all prokaryotes and normalised for base composition and embedded oligomers using zero and second order Markov models. Principal Findings Here we report a presumably ancient set of oligomers conserved and overrepresented in nearly all branches of prokaryotic life, including Archaea. These oligomers are either adenine rich homopurines with one to three guanine nucleosides, or homopyridimines with one to four cytosine nucleosides. They do not show a consistent preference for coding or non-coding regions or aggregate in any coding frame, implying a role in DNA structure and as polypeptide binding sites. Structural parameters indicate these oligonucleotides to be an extreme and rigid form of B-DNA prone to forming triple stranded helices under common physiological conditions. Moreover, the narrow minor grooves of these structures are recognised by DNA binding and nucleoid associated proteins such as HU. Conclusion Homopurine and homopyrimidine oligomers exhibit distinct and unusual structural features and are present at high copy number in nearly all prokaryotic lineages. This fact suggests a non-neutral role of these oligonucleotides for bacterial genome organization that has been maintained throughout evolution.
Collapse
Affiliation(s)
- Colin F Davenport
- Pediatric Pneumology and Neonatology, Hanover Medical School, Hanover, Lower Saxony, Germany.
| | | |
Collapse
|
34
|
Association analysis of the general environmental conditions and prokaryotes' gene distributions in various functional groups. Genomics 2010; 96:27-38. [PMID: 20338234 DOI: 10.1016/j.ygeno.2010.03.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2009] [Revised: 03/14/2010] [Accepted: 03/16/2010] [Indexed: 12/15/2022]
Abstract
The activities of prokaryotes are pivotal in shaping the environment, and at the same time are greatly influenced by the environment. By using the genomic data and environmental descriptions of the complete prokaryotic genomes in NCBI's Microbial Genome Project Database and applying statistical methods, we have identified in a systematic manner those gene groups whose presence/frequency patterns are different for organisms of different environmental conditions. Here environmental conditions are characterized in four dimensions--salinity, oxygen requirement, habitat and temperature, and are based on the controlled vocabularies that NCBI's Microbial Genome Project database uses to specify the organism information; and, gene groups are determined as Clusters of Orthologous Groups (COG) and KEGG Orthology (KO) groups. These identified COG and KO groups are considered as potentially correlated with certain environmental conditions, and are then mapped to the COG general categories and KEGG pathways to determine which part of the functional machinery of prokaryotic cells are correlated with the environments. The observations derived from the analysis of the COG and KO groups that are potentially correlated with the oxygen requirement and habitat conditions are in general consistent with existing studies on properties of organisms living in different conditions of these two environmental factors. To further assess the identified correlation relationships, we have also examined whether the environmental conditions are predictable based on the gene distributions in the selected COG and KO groups. The misclassification rates of the prediction experiments are much smaller than that rendered by random guessing, indicating the existence of the correlation relationships between organisms' environmental conditions and gene distributions in certain functional groups. However, the rather moderate misclassification rates (the 25- and 75-percentiles of the misclassification rates of all prediction experiments are 16.79% and 24.06%, respectively) also indicate that the correlation relationships between environmental conditions and gene distributions in certain functional groups are not strong enough for one to decisively define the other.
Collapse
|
35
|
Perry SC, Beiko RG. Distinguishing microbial genome fragments based on their composition: evolutionary and comparative genomic perspectives. Genome Biol Evol 2010; 2:117-31. [PMID: 20333228 PMCID: PMC2839357 DOI: 10.1093/gbe/evq004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/19/2010] [Indexed: 01/23/2023] Open
Abstract
It is well known that patterns of nucleotide composition vary within and among
genomes, although the reasons why these variations exist are not completely
understood. Between-genome compositional variation has been exploited to assign
environmental shotgun sequences to their most likely originating genomes,
whereas within-genome variation has been used to identify recently acquired
genetic material such as pathogenicity islands. Recent sequence assignment
techniques have achieved high levels of accuracy on artificial data sets, but
the relative difficulty of distinguishing lineages with varying degrees of
relatedness, and different types of genomic sequence, has not been examined in
depth. We investigated the compositional differences in a set of 774 sequenced
microbial genomes, finding rapid divergence among closely related genomes, but
also convergence of compositional patterns among genomes with similar habitats.
Support vector machines were then used to distinguish all pairs of genomes based
on genome fragments 500 nucleotides in length. The nearly 300,000 accuracy
scores obtained from these trials were used to construct general models of
distinguishability versus taxonomic and compositional indices of genomic
divergence. Unusual genome pairs were evident from their large residuals
relative to the fitted model, and we identified several factors including genome
reduction, putative lateral genetic transfer, and habitat convergence that
influence the distinguishability of genomes. The positional, compositional, and
functional context of a fragment within a genome has a strong influence on its
likelihood of correct classification, but in a way that depends on the taxonomic
and ecological similarity of the comparator genome.
Collapse
Affiliation(s)
- Scott C Perry
- Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
| | | |
Collapse
|
36
|
Examination of genome homogeneity in prokaryotes using genomic signatures. PLoS One 2009; 4:e8113. [PMID: 19956556 PMCID: PMC2781299 DOI: 10.1371/journal.pone.0008113] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2009] [Accepted: 11/05/2009] [Indexed: 01/17/2023] Open
Abstract
BACKGROUND DNA word frequencies, normalized for genomic AT content, are remarkably stable within prokaryotic genomes and are therefore said to reflect a "genomic signature." The genomic signatures can be used to phylogenetically classify organisms from arbitrary sampled DNA. Genomic signatures can also be used to search for horizontally transferred DNA or DNA regions subjected to special selection forces. Thus, the stability of the genomic signature can be used as a measure of genomic homogeneity. The factors associated with the stability of the genomic signatures are not known, and this motivated us to investigate further. We analyzed the intra-genomic variance of genomic signatures based on AT content normalization (0(th) order Markov model) as well as genomic signatures normalized by smaller DNA words (1(st) and 2(nd) order Markov models) for 636 sequenced prokaryotic genomes. Regression models were fitted, with intra-genomic signature variance as the response variable, to a set of factors representing genomic properties such as genomic AT content, genome size, habitat, phylum, oxygen requirement, optimal growth temperature and oligonucleotide usage variance (OUV, a measure of oligonucleotide usage bias), measured as the variance between genomic tetranucleotide frequencies and Markov chain approximated tetranucleotide frequencies, as predictors. PRINCIPAL FINDINGS Regression analysis revealed that OUV was the most important factor (p<0.001) determining intra-genomic homogeneity as measured using genomic signatures. This means that the less random the oligonucleotide usage is in the sense of higher OUV, the more homogeneous the genome is in terms of the genomic signature. The other factors influencing variance in the genomic signature (p<0.001) were genomic AT content, phylum and oxygen requirement. CONCLUSIONS Genomic homogeneity in prokaryotes is intimately linked to genomic GC content, oligonucleotide usage bias (OUV) and aerobiosis, while oligonucleotide usage bias (OUV) is associated with genomic GC content, aerobiosis and habitat.
Collapse
|
37
|
Bohlin J, Skjerve E, Ussery DW. Analysis of genomic signatures in prokaryotes using multinomial regression and hierarchical clustering. BMC Genomics 2009; 10:487. [PMID: 19845945 PMCID: PMC2770534 DOI: 10.1186/1471-2164-10-487] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2009] [Accepted: 10/21/2009] [Indexed: 11/26/2022] Open
Abstract
Background Recently there has been an explosion in the availability of bacterial genomic sequences, making possible now an analysis of genomic signatures across more than 800 hundred different bacterial chromosomes, from a wide variety of environments. Using genomic signatures, we pair-wise compared 867 different genomic DNA sequences, taken from chromosomes and plasmids more than 100,000 base-pairs in length. Hierarchical clustering was performed on the outcome of the comparisons before a multinomial regression model was fitted. The regression model included the cluster groups as the response variable with AT content, phyla, growth temperature, selective pressure, habitat, sequence size, oxygen requirement and pathogenicity as predictors. Results Many significant factors were associated with the genomic signature, most notably AT content. Phyla was also an important factor, although considerably less so than AT content. Small improvements to the regression model, although significant, were also obtained by factors such as sequence size, habitat, growth temperature, selective pressure measured as oligonucleotide usage variance, and oxygen requirement. Conclusion The statistics obtained using hierarchical clustering and multinomial regression analysis indicate that the genomic signature is shaped by many factors, and this may explain the varying ability to classify prokaryotic organisms below genus level.
Collapse
Affiliation(s)
- Jon Bohlin
- Norwegian School of Veterinary Science, Oslo, Norway.
| | | | | |
Collapse
|
38
|
Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton AP, Banfield JF. Community-wide analysis of microbial genome sequence signatures. Genome Biol 2009; 10:R85. [PMID: 19698104 PMCID: PMC2745766 DOI: 10.1186/gb-2009-10-8-r85] [Citation(s) in RCA: 373] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2009] [Revised: 07/10/2009] [Accepted: 08/21/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Analyses of DNA sequences from cultivated microorganisms have revealed genome-wide, taxa-specific nucleotide compositional characteristics, referred to as genome signatures. These signatures have far-reaching implications for understanding genome evolution and potential application in classification of metagenomic sequence fragments. However, little is known regarding the distribution of genome signatures in natural microbial communities or the extent to which environmental factors shape them. RESULTS We analyzed metagenomic sequence data from two acidophilic biofilm communities, including composite genomes reconstructed for nine archaea, three bacteria, and numerous associated viruses, as well as thousands of unassigned fragments from strain variants and low-abundance organisms. Genome signatures, in the form of tetranucleotide frequencies analyzed by emergent self-organizing maps, segregated sequences from all known populations sharing < 50 to 60% average amino acid identity and revealed previously unknown genomic clusters corresponding to low-abundance organisms and a putative plasmid. Signatures were pervasive genome-wide. Clusters were resolved because intra-genome differences resulting from translational selection or protein adaptation to the intracellular (pH approximately 5) versus extracellular (pH approximately 1) environment were small relative to inter-genome differences. We found that these genome signatures stem from multiple influences but are primarily manifested through codon composition, which we propose is the result of genome-specific mutational biases. CONCLUSIONS An important conclusion is that shared environmental pressures and interactions among coevolving organisms do not obscure genome signatures in acid mine drainage communities. Thus, genome signatures can be used to assign sequence fragments to populations, an essential prerequisite if metagenomics is to provide ecological and biochemical insights into the functioning of microbial communities.
Collapse
Affiliation(s)
- Gregory J Dick
- Department of Earth and Planetary Science, University of California, 307 McCone Hall, Berkeley, CA 94720, USA
- Current address: Department of Geological Sciences, University of Michigan, 1100 N. University Ave, Ann Arbor, MI 48109-1005, USA
| | - Anders F Andersson
- Department of Earth and Planetary Science, University of California, 307 McCone Hall, Berkeley, CA 94720, USA
- Current address: Evolutionary Biology Centre, Department of Limnology, Uppsala University, Norbyv. 18 D, SE-75236, Uppsala, Sweden
- Current address: Department of Bacteriology, Swedish Institute for Infectious Disease Control, Nobels väg 18 SE-17182 Solna, Sweden
| | - Brett J Baker
- Department of Earth and Planetary Science, University of California, 307 McCone Hall, Berkeley, CA 94720, USA
| | - Sheri L Simmons
- Department of Earth and Planetary Science, University of California, 307 McCone Hall, Berkeley, CA 94720, USA
| | - Brian C Thomas
- Department of Earth and Planetary Science, University of California, 307 McCone Hall, Berkeley, CA 94720, USA
| | - A Pepper Yelton
- Department of Earth and Planetary Science, University of California, 307 McCone Hall, Berkeley, CA 94720, USA
| | - Jillian F Banfield
- Department of Earth and Planetary Science, University of California, 307 McCone Hall, Berkeley, CA 94720, USA
- Department of Environmental Science, Policy, and Management, University of California, Hilgard Hall, Berkeley, CA 94720, USA
| |
Collapse
|
39
|
Sekse C, Muniesa M, Wasteson Y. Conserved Stx2 phages from Escherichia coli O103:H25 isolated from patients suffering from hemolytic uremic syndrome. Foodborne Pathog Dis 2009; 5:801-10. [PMID: 19014273 DOI: 10.1089/fpd.2008.0130] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND One of the main virulence factors produced by Shiga toxin-producing Escherichia coli is the Shiga toxin (Stx), which is encoded on lambdoid phages (Stx phage). In Norway, an outbreak of hemorrhagic colitis and hemolytic uremic syndrome (HUS) caused by E. coli O103:H25 was reported during the winter of 2006, but stx(2)-positive isolates were only retrieved from two human samples. METHODS Isolates of E. coli O103:H25 from patients with HUS in Norway, including sporadic cases and the outbreak cases, were investigated for the presence of phages encoding stx(2). The induced Stx phages were characterized morphologically and genetically, and the host susceptibility for these phages of various E. coli O103 isolates, including O103:H25 stx(2) negative isolates from the outbreak, was tested by a plaque assay. RESULTS The Stx2 phages in this study are very closely related in terms of morphology, sequence identity, and host infectivity. There may be a conserved phage within the E. coli O103:H25 population. CONCLUSIONS It is proposed that the Stx2 phage, present in the environment either as free phage particles or within a limited pool of Stx-producing E. coli O103 strains, have infected or integrated in the stx(2)-negative E. coli O103:H25 isolates from the Norwegian outbreak.
Collapse
Affiliation(s)
- Camilla Sekse
- Department of Food Safety and Infection Biology, Norwegian School of Veterinary Science, Oslo, Norway
| | | | | |
Collapse
|
40
|
TACOA: taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 2009; 10:56. [PMID: 19210774 PMCID: PMC2653487 DOI: 10.1186/1471-2105-10-56] [Citation(s) in RCA: 142] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2008] [Accepted: 02/11/2009] [Indexed: 02/03/2023] Open
Abstract
Background Metagenomics, or the sequencing and analysis of collective genomes (metagenomes) of microorganisms isolated from an environment, promises direct access to the "unculturable majority". This emerging field offers the potential to lay solid basis on our understanding of the entire living world. However, the taxonomic classification is an essential task in the analysis of metagenomics data sets that it is still far from being solved. We present a novel strategy to predict the taxonomic origin of environmental genomic fragments. The proposed classifier combines the idea of the k-nearest neighbor with strategies from kernel-based learning. Results Our novel strategy was extensively evaluated using the leave-one-out cross validation strategy on fragments of variable length (800 bp – 50 Kbp) from 373 completely sequenced genomes. TACOA is able to classify genomic fragments of length 800 bp and 1 Kbp with high accuracy until rank class. For longer fragments ≥ 3 Kbp accurate predictions are made at even deeper taxonomic ranks (order and genus). Remarkably, TACOA also produces reliable results when the taxonomic origin of a fragment is not represented in the reference set, thus classifying such fragments to its known broader taxonomic class or simply as "unknown". We compared the classification accuracy of TACOA with the latest intrinsic classifier PhyloPythia using 63 recently published complete genomes. For fragments of length 800 bp and 1 Kbp the overall accuracy of TACOA is higher than that obtained by PhyloPythia at all taxonomic ranks. For all fragment lengths, both methods achieved comparable high specificity results up to rank class and low false negative rates are also obtained. Conclusion An accurate multi-class taxonomic classifier was developed for environmental genomic fragments. TACOA can predict with high reliability the taxonomic origin of genomic fragments as short as 800 bp. The proposed method is transparent, fast, accurate and the reference set can be easily updated as newly sequenced genomes become available. Moreover, the method demonstrated to be competitive when compared to the most current classifier PhyloPythia and has the advantage that it can be locally installed and the reference set can be kept up-to-date.
Collapse
|
41
|
Davenport CF, Wiehlmann L, Reva ON, Tümmler B. Visualization of Pseudomonas genomic structure by abundant 8-14mer oligonucleotides. Environ Microbiol 2009; 11:1092-104. [PMID: 19161433 DOI: 10.1111/j.1462-2920.2008.01839.x] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Under- and over-represented mono- to hexanucleotides are signatures of bacterial genomes, but the compositional biases of octa- to tetradecanucleotides have not yet been explored. Thirteen completely sequenced genomes of the Pseudomonas genus were searched for highly overrepresented 8-14mers. Between 59-989 overrepresented 8-14mers were found to exceed the applied threshold value. All genomic data sets of the 13 strains showed a consistent pattern, with individual oligomers clustering in either non-coding or coding regions. Non-coding oligonucleotides were typically part of longer repeats. Coding oligonucleotides were evenly distributed in the core genome, preferred one reading frame and matched with the local tetranucleotide usage patterns. Genomic islands were recognized by the depletion of overrepresented oligonucleotides. Several mainly coding 8-14mers occurred in genomes on average every 10 000 bp or less. Such frequently occurring 8-14mers could become useful markers for species identification. In the future of next-generation ultra-high throughput DNA sequencing, the composition of bacterial metagenomes may be quantified by scanning the primary sequence reads for these 8-14mer markers.
Collapse
Affiliation(s)
- Colin F Davenport
- Klinische Forschergruppe, OE 6711, Medizinische Hochschule Hannover, Hanover, Germany.
| | | | | | | |
Collapse
|