1
|
Barr JJ, Dutilh BE, Skennerton CT, Fukushima T, Hastie ML, Gorman JJ, Tyson GW, Bond PL. Metagenomic and metaproteomic analyses of Accumulibacter phosphatis-enriched floccular and granular biofilm. Environ Microbiol 2015; 18:273-87. [PMID: 26279094 DOI: 10.1111/1462-2920.13019] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2015] [Revised: 06/30/2015] [Accepted: 08/11/2015] [Indexed: 11/30/2022]
Abstract
Biofilms are ubiquitous in nature, forming diverse adherent microbial communities that perform a plethora of functions. Here we operated two laboratory-scale sequencing batch reactors enriched with Candidatus Accumulibacter phosphatis (Accumulibacter) performing enhanced biological phosphorus removal. Reactors formed two distinct biofilms, one floccular biofilm, consisting of small, loose, microbial aggregates, and one granular biofilm, forming larger, dense, spherical aggregates. Using metagenomic and metaproteomic methods, we investigated the proteomic differences between these two biofilm communities, identifying a total of 2022 unique proteins. To understand biofilm differences, we compared protein abundances that were statistically enriched in both biofilm states. Floccular biofilms were enriched with pathogenic secretion systems suggesting a highly competitive microbial community. Comparatively, granular biofilms revealed a high-stress environment with evidence of nutrient starvation, phage predation pressure, and increased extracellular polymeric substance and cell lysis. Granular biofilms were enriched in outer membrane transport proteins to scavenge the extracellular milieu for amino acids and other metabolites, likely released through cell lysis, to supplement metabolic pathways. This study provides the first detailed proteomic comparison between Accumulibacter-enriched floccular and granular biofilm communities, proposes a conceptual model for the granule biofilm, and offers novel insights into granule biofilm formation and stability.
Collapse
Affiliation(s)
- Jeremy J Barr
- Department of Biology, San Diego State University, San Diego, CA, USA.,Advanced Water Management Centre (AWMC), The University of Queensland, Brisbane, Qld, Australia.,Environmental Biotechnology Cooperative Research Centre (EBCRC), Sydney, NSW, Australia
| | - Bas E Dutilh
- Theoretical Biology and Bioinformatics, Utrecht University, Utrecht, The Netherlands.,Centre for Molecular and Biomedical Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Centre, Nijmegen, The Netherlands.,Department of Marine Biology, Institute of Biology, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
| | - Connor T Skennerton
- Advanced Water Management Centre (AWMC), The University of Queensland, Brisbane, Qld, Australia.,Australian Centre for Ecogenomics, School of Chemistry and Molecular Bioscience, The University of Queensland, Brisbane, Qld, Australia.,Division of Geological and Planetary Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Toshikazu Fukushima
- Advanced Water Management Centre (AWMC), The University of Queensland, Brisbane, Qld, Australia.,Division of Environmental Studies, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan
| | - Marcus L Hastie
- Protein Discovery Centre, Queensland Institute of Medical Research (QIMR) Berghofer Medical Research Institute, Herston, Qld, Australia
| | - Jeffrey J Gorman
- Protein Discovery Centre, Queensland Institute of Medical Research (QIMR) Berghofer Medical Research Institute, Herston, Qld, Australia
| | - Gene W Tyson
- Advanced Water Management Centre (AWMC), The University of Queensland, Brisbane, Qld, Australia.,Australian Centre for Ecogenomics, School of Chemistry and Molecular Bioscience, The University of Queensland, Brisbane, Qld, Australia
| | - Philip L Bond
- Advanced Water Management Centre (AWMC), The University of Queensland, Brisbane, Qld, Australia.,Environmental Biotechnology Cooperative Research Centre (EBCRC), Sydney, NSW, Australia
| |
Collapse
|
2
|
Wang JD. Comparing virus classification using genomic materials according to different taxonomic levels. J Bioinform Comput Biol 2013; 11:1343003. [PMID: 24372032 DOI: 10.1142/s0219720013430038] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, three genomic materials--DNA sequences, protein sequences, and regions (domains) are used to compare methods of virus classification. Virus classes (categories) are divided by various taxonomic level of virus into three datasets for 6 order, 42 family, and 33 genera. To increase the robustness and comparability of experimental results of virus classification, the classes are selected that contain at least 10 instances, and meanwhile each instance contains at least one region name. Experimental results show that the approach using region names achieved the best accuracies--reaching 99.9%, 97.3%, and 99.0% for 6 orders, 42 families, and 33 genera, respectively. This paper not only involves exhaustive experiments that compare virus classifications using different genomic materials, but also proposes a novel approach to biological classification based on molecular biology instead of traditional morphology.
Collapse
Affiliation(s)
- Jing-Doo Wang
- Department of Computer Science and Information Engineering, Asia University, No 500, Lioufeng Road Wufeng, Taichung 41354, Taiwan
| |
Collapse
|
3
|
Dutilh BE, Backus L, Edwards RA, Wels M, Bayjanov JR, van Hijum SAFT. Explaining microbial phenotypes on a genomic scale: GWAS for microbes. Brief Funct Genomics 2013; 12:366-80. [PMID: 23625995 PMCID: PMC3743258 DOI: 10.1093/bfgp/elt008] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
Abstract
There is an increasing availability of complete or draft genome sequences for microbial organisms. These data form a potentially valuable resource for genotype-phenotype association and gene function prediction, provided that phenotypes are consistently annotated for all the sequenced strains. In this review, we address the requirements for successful gene-trait matching. We outline a basic protocol for microbial functional genomics, including genome assembly, annotation of genotypes (including single nucleotide polymorphisms, orthologous groups and prophages), data pre-processing, genotype-phenotype association, visualization and interpretation of results. The methodologies for association described herein can be applied to other data types, opening up possibilities to analyze transcriptome-phenotype associations, and correlate microbial population structure or activity, as measured by metagenomics, to environmental parameters.
Collapse
Affiliation(s)
- Bas E Dutilh
- CMBI, NCMLS, Radboud University Medical Centre. Geert Grooteplein 28, 6525 GA Nijmegen, The Netherlands.
| | | | | | | | | | | |
Collapse
|
4
|
Gori F, Tringe SG, Folino G, van Hijum SAFT, Op den Camp HJM, Jetten MSM, Marchiori E. Differences in sequencing technologies improve the retrieval of anammox bacterial genome from metagenomes. BMC Genomics 2013; 14:7. [PMID: 23324532 PMCID: PMC3618311 DOI: 10.1186/1471-2164-14-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2012] [Accepted: 12/13/2012] [Indexed: 11/27/2022] Open
Abstract
Background Sequencing technologies have different biases, in single-genome sequencing and metagenomic sequencing; these can significantly affect ORFs recovery and the population distribution of a metagenome. In this paper we investigate how well different technologies represent information related to a considered organism of interest in a metagenome, and whether it is beneficial to combine information obtained using different technologies. We analyze comparatively three metagenomic datasets acquired from a sample containing the anammox bacterium Candidatus ’Brocadia fulgida’ (B. fulgida). These datasets were obtained using Roche 454 FLX and Sanger sequencing with two different libraries (shotgun and fosmid). Results In each dataset, the abundance of the reads annotated to B. fulgida was much lower than the abundance expected from available cell count information. This was due to the overrepresentation of GC-richer organisms, as shown by GC-content distribution of the reads. Nevertheless, by considering the union of B. fulgida reads over the three datasets, the number of B. fulgida ORFs recovered for at least 80% of their length was twice the amount recovered by the best technology. Indeed, while taxonomic distributions of reads in the three datasets were similar, the respective sets of B. fulgida ORFs recovered for a large part of their length were highly different, and depth of coverage patterns of 454 and Sanger were dissimilar. Conclusions Precautions should be sought in order to prevent the overrepresentation of GC-rich microbes in the datasets. This overrepresentation and the consistency of the taxonomic distributions of reads obtained with different sequencing technologies suggests that, in general, abundance biases might be mainly due to other steps of the sequencing protocols. Results show that biases against organisms of interest could be compensated combining different sequencing technologies, due to the differences of their genome-level sequencing biases even if the species was present in not very different abundances in the metagenomes.
Collapse
Affiliation(s)
- Fabio Gori
- Radboud University Nijmegen, Institute for Computing and Information Science, Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands.
| | | | | | | | | | | | | |
Collapse
|
5
|
Boleij A, Dutilh BE, Kortman GAM, Roelofs R, Laarakkers CM, Engelke UF, Tjalsma H. Bacterial responses to a simulated colon tumor microenvironment. Mol Cell Proteomics 2012; 11:851-62. [PMID: 22713208 DOI: 10.1074/mcp.m112.019315] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023] Open
Abstract
One of the few bacteria that have been consistently linked to colorectal cancer (CRC) is the opportunistic pathogen Streptococcus gallolyticus. Infections with this bacterium are generally regarded as an indicator for colonic malignancy, while the carriage rate of this bacterium in the healthy large intestine is relatively low. We speculated that the physiological changes accompanying the development of CRC might favor the colonization of this bacterium. To investigate whether colon tumor cells can support the survival of S. gallolyticus, this bacterium was grown in spent medium of malignant colonocytes to simulate the altered metabolic conditions in the CRC microenvironment. These in vitro simulations indicated that S. gallolyticus had a significant growth advantage in these spent media, which was not observed for other intestinal bacteria. Under these conditions, bacterial responses were profiled by proteome analysis and metabolic shifts were analyzed by (1)H-NMR-spectroscopy. In silico pathway analysis of the differentially expressed proteins and metabolite analysis indicated that this advantage resulted from the increased utilization of glucose, glucose derivates, and alanine. Together, these data suggest that tumor cell metabolites facilitate the survival of S. gallolyticus, favoring its local outgrowth and providing a possible explanation for the specific association of S. gallolyticus with colonic malignancy.
Collapse
Affiliation(s)
- Annemarie Boleij
- Department of Laboratory Medicine/830, Radboud University Medical Centre, 6500 HB Nijmegen, the Netherlands
| | | | | | | | | | | | | |
Collapse
|
6
|
Linear normalised hash function for clustering gene sequences and identifying reference sequences from multiple sequence alignments. MICROBIAL INFORMATICS AND EXPERIMENTATION 2012; 2:2. [PMID: 22587938 PMCID: PMC3351711 DOI: 10.1186/2042-5783-2-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2011] [Accepted: 01/26/2012] [Indexed: 11/10/2022]
Abstract
BACKGROUND Comparative genomics has put additional demands on the assessment of similarity between sequences and their clustering as means for classification. However, defining the optimal number of clusters, cluster density and boundaries for sets of potentially related sequences of genes with variable degrees of polymorphism remains a significant challenge. The aim of this study was to develop a method that would identify the cluster centroids and the optimal number of clusters for a given sensitivity level and could work equally well for the different sequence datasets. RESULTS A novel method that combines the linear mapping hash function and multiple sequence alignment (MSA) was developed. This method takes advantage of the already sorted by similarity sequences from the MSA output, and identifies the optimal number of clusters, clusters cut-offs, and clusters centroids that can represent reference gene vouchers for the different species. The linear mapping hash function can map an already ordered by similarity distance matrix to indices to reveal gaps in the values around which the optimal cut-offs of the different clusters can be identified. The method was evaluated using sets of closely related (16S rRNA gene sequences of Nocardia species) and highly variable (VP1 genomic region of Enterovirus 71) sequences and outperformed existing unsupervised machine learning clustering methods and dimensionality reduction methods. This method does not require prior knowledge of the number of clusters or the distance between clusters, handles clusters of different sizes and shapes, and scales linearly with the dataset. CONCLUSIONS The combination of MSA with the linear mapping hash function is a computationally efficient way of gene sequence clustering and can be a valuable tool for the assessment of similarity, clustering of different microbial genomes, identifying reference sequences, and for the study of evolution of bacteria and viruses.
Collapse
|
7
|
Helal M, Kong F, Chen SCA, Bain M, Christen R, Sintchenko V. Defining reference sequences for Nocardia species by similarity and clustering analyses of 16S rRNA gene sequence data. PLoS One 2011; 6:e19517. [PMID: 21687706 PMCID: PMC3110597 DOI: 10.1371/journal.pone.0019517] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2010] [Accepted: 04/08/2011] [Indexed: 01/08/2023] Open
Abstract
Background The intra- and inter-species genetic diversity of bacteria and the absence of ‘reference’, or the most representative, sequences of individual species present a significant challenge for sequence-based identification. The aims of this study were to determine the utility, and compare the performance of several clustering and classification algorithms to identify the species of 364 sequences of 16S rRNA gene with a defined species in GenBank, and 110 sequences of 16S rRNA gene with no defined species, all within the genus Nocardia. Methods A total of 364 16S rRNA gene sequences of Nocardia species were studied. In addition, 110 16S rRNA gene sequences assigned only to the Nocardia genus level at the time of submission to GenBank were used for machine learning classification experiments. Different clustering algorithms were compared with a novel algorithm or the linear mapping (LM) of the distance matrix. Principal Components Analysis was used for the dimensionality reduction and visualization. Results The LM algorithm achieved the highest performance and classified the set of 364 16S rRNA sequences into 80 clusters, the majority of which (83.52%) corresponded with the original species. The most representative 16S rRNA sequences for individual Nocardia species have been identified as ‘centroids’ in respective clusters from which the distances to all other sequences were minimized; 110 16S rRNA gene sequences with identifications recorded only at the genus level were classified using machine learning methods. Simple kNN machine learning demonstrated the highest performance and classified Nocardia species sequences with an accuracy of 92.7% and a mean frequency of 0.578. Conclusion The identification of centroids of 16S rRNA gene sequence clusters using novel distance matrix clustering enables the identification of the most representative sequences for each individual species of Nocardia and allows the quantitation of inter- and intra-species variability.
Collapse
Affiliation(s)
- Manal Helal
- Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia
- Centre for Infectious Diseases and Microbiology, Westmead Hospital, Sydney West Area Health Service, Sydney, New South Wales, Australia
| | - Fanrong Kong
- Centre for Infectious Diseases and Microbiology, Westmead Hospital, Sydney West Area Health Service, Sydney, New South Wales, Australia
| | - Sharon C. A. Chen
- Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia
- Centre for Infectious Diseases and Microbiology, Westmead Hospital, Sydney West Area Health Service, Sydney, New South Wales, Australia
| | - Michael Bain
- School of Computer Science and Engineering, University of New South Wales, Sydney, New South Wales, Australia
| | - Richard Christen
- University of Nice Sophia-Antipolis, and CNRS UMR6543, Parc Valrose, Centre de Biochimie, Nice, France
| | - Vitali Sintchenko
- Sydney Medical School, The University of Sydney, Sydney, New South Wales, Australia
- Centre for Infectious Diseases and Microbiology, Westmead Hospital, Sydney West Area Health Service, Sydney, New South Wales, Australia
- * E-mail:
| |
Collapse
|
8
|
Mitra S, Rupek P, Richter DC, Urich T, Gilbert JA, Meyer F, Wilke A, Huson DH. Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG. BMC Bioinformatics 2011; 12 Suppl 1:S21. [PMID: 21342551 PMCID: PMC3044276 DOI: 10.1186/1471-2105-12-s1-s21] [Citation(s) in RCA: 99] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Background Metagenomics is the study of microbial organisms using sequencing applied directly to environmental samples. Technological advances in next-generation sequencing methods are fueling a rapid increase in the number and scope of metagenome projects. While metagenomics provides information on the gene content, metatranscriptomics aims at understanding gene expression patterns in microbial communities. The initial computational analysis of a metagenome or metatranscriptome addresses three questions: (1) Who is out there? (2) What are they doing? and (3) How do different datasets compare? There is a need for new computational tools to answer these questions. In 2007, the program MEGAN (MEtaGenome ANalyzer) was released, as a standalone interactive tool for analyzing the taxonomic content of a single metagenome dataset. The program has subsequently been extended to support comparative analyses of multiple datasets. Results The focus of this paper is to report on new features of MEGAN that allow the functional analysis of multiple metagenomes (and metatranscriptomes) based on the SEED hierarchy and KEGG pathways. We have compared our results with the MG-RAST service for different datasets. Conclusions The MEGAN program now allows the interactive analysis and comparison of the taxonomical and functional content of multiple datasets. As a stand-alone tool, MEGAN provides an alternative to web portals for scientists that have concerns about uploading their unpublished data to a website.
Collapse
Affiliation(s)
- Suparna Mitra
- Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076 Tübingen, Germany.
| | | | | | | | | | | | | | | |
Collapse
|
9
|
Molecular signatures for the Crenarchaeota and the Thaumarchaeota. Antonie van Leeuwenhoek 2010; 99:133-57. [PMID: 20711675 DOI: 10.1007/s10482-010-9488-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 07/26/2010] [Indexed: 10/19/2022]
Abstract
Crenarchaeotes found in mesophilic marine environments were recently placed into a new phylum of Archaea called the Thaumarchaeota. However, very few molecular characteristics of this new phylum are currently known which can be used to distinguish them from the Crenarchaeota. In addition, their relationships to deep-branching archaeal lineages are unclear. We report here detailed analyses of protein sequences from Crenarchaeota and Thaumarchaeota that have identified many conserved signature indels (CSIs) and signature proteins (SPs) (i.e., proteins for which all significant blast hits are from these groups) that are specific for these archaeal groups. Of the identified signatures 6 CSIs and 13 SPs are specific for the Crenarchaeota phylum; 6 CSIs and >250 SPs are uniquely found in various Thaumarchaeota (viz. Cenarchaeum symbiosum, Nitrosopumilus maritimus and a number of uncultured marine crenarchaeotes) and 3 CSIs and ~10 SPs are found in both Thaumarchaeota and Crenarchaeota species. Some of the molecular signatures are also present in Korarchaeum cryptofilum, which forms the independent phylum Korarchaeota. Although some of these molecular signatures suggest a distant shared ancestry between Thaumarchaeota and Crenarchaeota, our identification of large numbers of Thaumarchaeota-specific proteins and their deep branching between the Crenarchaeota and Euryarchaeota phyla in phylogenetic trees shows that they are distinct from both Crenarchaeota and Euryarchaeota in both genetic and phylogenetic terms. These observations support the placement of marine mesophilic archaea into the separate phylum Thaumarchaeota. Additionally, many CSIs and SPs have been found that are specific for different orders within Crenarchaeota (viz. Sulfolobales-3 CSIs and 169 SPs, Thermoproteales-5 CSIs and 25 SPs, Desulfurococcales-4 SPs, and Sulfolobales and Desulfurococcales-2 CSIs and 18 SPs). The signatures described here provide novel means for distinguishing the Crenarchaeota and the Thaumarchaeota and for the classification of related and novel species in different environments. Functional studies on these signature proteins could lead to discovery of novel biochemical properties that are unique to these groups of archaea.
Collapse
|
10
|
Genome analysis of Moraxella catarrhalis strain BBH18, [corrected] a human respiratory tract pathogen. J Bacteriol 2010; 192:3574-83. [PMID: 20453089 DOI: 10.1128/jb.00121-10] [Citation(s) in RCA: 73] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Moraxella catarrhalis is an emerging human-restricted respiratory tract pathogen that is a common cause of childhood otitis media and exacerbations of chronic obstructive pulmonary disease in adults. Here, we report the first completely assembled and annotated genome sequence of an isolate of M. catarrhalis, strain RH4, which originally was isolated from blood of an infected patient. The RH4 genome consists of 1,863,286 nucleotides that form 1,886 protein-encoding genes. Comparison of the RH4 genome to the ATCC 43617 contigs demonstrated that the gene content of both strains is highly conserved. In silico phylogenetic analyses based on both 16S rRNA and multilocus sequence typing revealed that RH4 belongs to the seroresistant lineage. We were able to identify almost the entire repertoire of known M. catarrhalis virulence factors and mapped the members of the biosynthetic pathways for lipooligosaccharide, peptidoglycan, and type IV pili. Reconstruction of the central metabolic pathways suggested that RH4 relies on fatty acid and acetate metabolism, as the genes encoding the enzymes required for the glyoxylate pathway, the tricarboxylic acid cycle, the gluconeogenic pathway, the nonoxidative branch of the pentose phosphate pathway, the beta-oxidation pathway of fatty acids, and acetate metabolism were present. Moreover, pathways important for survival under challenging in vivo conditions, such as the iron-acquisition pathways, nitrogen metabolism, and oxidative stress responses, were identified. Finally, we showed by microarray expression profiling that approximately 88% of the predicted coding sequences are transcribed under in vitro conditions. Overall, these results provide a foundation for future research into the mechanisms of M. catarrhalis pathogenesis and vaccine development.
Collapse
|
11
|
Abstract
BACKGROUND Metagenomics is the study of the genomic content of an environmental sample of microbes. Advances in the through-put and cost-efficiency of sequencing technology is fueling a rapid increase in the number and size of metagenomic datasets being generated. Bioinformatics is faced with the problem of how to handle and analyze these datasets in an efficient and useful way. One goal of these metagenomic studies is to get a basic understanding of the microbial world both surrounding us and within us. One major challenge is how to compare multiple datasets. Furthermore, there is a need for bioinformatics tools that can process many large datasets and are easy to use. RESULTS This article describes two new and helpful techniques for comparing multiple metagenomic datasets. The first is a visualization technique for multiple datasets and the second is a new statistical method for highlighting the differences in a pairwise comparison. We have developed implementations of both methods that are suitable for very large datasets and provide these in Version 3 of our standalone metagenome analysis tool MEGAN. CONCLUSION These new methods are suitable for the visual comparison of many large metagenomes and the statistical comparison of two metagenomes at a time. Nevertheless, more work needs to be done to support the comparative analysis of multiple metagenome datasets. AVAILABILITY Version 3 of MEGAN, which implements all ideas presented in this article, can be obtained from our web site at: www-ab.informatik.uni-tuebingen.de/software/megan. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Suparna Mitra
- Center for Bioinformatics ZBIT, Tübingen University, Sand 14, 72076 Tübingen, Germany.
| | | | | |
Collapse
|
12
|
Abstract
Gene content has been shown to contain a strong phylogenetic signal, yet its usage for phylogenetic questions is hampered by horizontal gene transfer and parallel gene loss and until now required completely sequenced genomes. Here, we introduce an approach that allows the phylogenetic signal in gene content to be applied to any set of sequences, using signature genes for phylogenetic classification. The hundreds of publicly available genomes allow us to identify signature genes at various taxonomic depths, and we show how the presence of signature genes in an unspecified sample can be used to characterize its taxonomic composition. We identify 8,362 signature genes specific for 112 prokaryotic taxa. We show that these signature genes can be used to address phylogenetic questions on the basis of gene content in cases where classic gene content or sequence analyses provide an ambiguous answer, such as for Nanoarchaeum equitans, and even in cases where complete genomes are not available, such as for metagenomics data. Cross-validation experiments leaving out up to 30% of the species show that ∼92% of the signature genes correctly place the species in a related clade. Analyses of metagenomics data sets with the signature gene approach are in good agreement with the previously reported species distributions based on phylogenetic analysis of marker genes. Summarizing, signature genes can complement traditional sequence-based methods in addressing taxonomic questions.
Collapse
Affiliation(s)
- Bas E Dutilh
- Center for Molecular and Biomolecular Informatics/Nijmegen Center for Molecular Life Sciences, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
| | | | | | | |
Collapse
|