1
|
Gao Y, Ma B, Xu Q, Peng Y, Gong H, Guan A, Hua K, Langford PR, Jin H, Luo R. Spatial proximity and gene function: a new dimension in prokaryotic gene association network analysis with 3D-GeneNet. Brief Bioinform 2024; 25:bbae320. [PMID: 38975892 PMCID: PMC11229033 DOI: 10.1093/bib/bbae320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 05/22/2024] [Accepted: 06/18/2024] [Indexed: 07/09/2024] Open
Abstract
Understanding the biological functions and processes of genes, particularly those not yet characterized, is crucial for advancing molecular biology and identifying therapeutic targets. The hypothesis guiding this study is that the 3D proximity of genes correlates with their functional interactions and relevance in prokaryotes. We introduced 3D-GeneNet, an innovative software tool that utilizes high-throughput sequencing data from chromosome conformation capture techniques and integrates topological metrics to construct gene association networks. Through a series of comparative analyses focused on spatial versus linear distances, we explored various dimensions such as topological structure, functional enrichment levels, distribution patterns of linear distances among gene pairs, and the area under the receiver operating characteristic curve by utilizing model organism Escherichia coli K-12. Furthermore, 3D-GeneNet was shown to maintain good accuracy compared to multiple algorithms (neighbourhood, co-occurrence, coexpression, and fusion) across multiple bacteria, including E. coli, Brucella abortus, and Vibrio cholerae. In addition, the accuracy of 3D-GeneNet's prediction of long-distance gene interactions was identified by bacterial two-hybrid assays on E. coli K-12 MG1655, where 3D-GeneNet not only increased the accuracy of linear genomic distance tripled but also achieved 60% accuracy by running alone. Finally, it can be concluded that the applicability of 3D-GeneNet will extend to various bacterial forms, including Gram-negative, Gram-positive, single-, and multi-chromosomal bacteria through Hi-C sequencing and analysis. Such findings highlight the broad applicability and significant promise of this method in the realm of gene association network. 3D-GeneNet is freely accessible at https://github.com/gaoyuanccc/3D-GeneNet.
Collapse
Affiliation(s)
- Yuan Gao
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| | - Bin Ma
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| | - Qianshuai Xu
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| | - Yuna Peng
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| | - Huimin Gong
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| | - Aohan Guan
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| | - Kexin Hua
- Swine Genome and Breeding Team, Yazhouwan National Laboratory, No. 8 Huanjin Road, Yazhou District, Sanya City, Hainan Province 572024, China
| | - Paul R Langford
- Section of Paediatric Infectious Disease, Imperial College London, St Mary's Campus, Norfolk Place, London W2 1PG, United Kingdom
| | - Hui Jin
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| | - Rui Luo
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- College of Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
- Hubei Provincial Key Laboratory of Preventive Veterinary Medicine, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, Hubei, China
| |
Collapse
|
2
|
Gumerov VM, Zhulin IB. TREND: a platform for exploring protein function in prokaryotes based on phylogenetic, domain architecture and gene neighborhood analyses. Nucleic Acids Res 2020; 48:W72-W76. [PMID: 32282909 PMCID: PMC7319448 DOI: 10.1093/nar/gkaa243] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 03/16/2020] [Accepted: 04/01/2020] [Indexed: 01/16/2023] Open
Abstract
Key steps in a computational study of protein function involve analysis of (i) relationships between homologous proteins, (ii) protein domain architecture and (iii) gene neighborhoods the corresponding proteins are encoded in. Each of these steps requires a separate computational task and sets of tools. Currently in order to relate protein features and gene neighborhoods information to phylogeny, researchers need to prepare all the necessary data and combine them by hand, which is time-consuming and error-prone. Here, we present a new platform, TREND (tree-based exploration of neighborhoods and domains), which can perform all the necessary steps in automated fashion and put the derived information into phylogenomic context, thus making evolutionary based protein function analysis more efficient. A rich set of adjustable components allows a user to run the computational steps specific to his task. TREND is freely available at http://trend.zhulinlab.org.
Collapse
Affiliation(s)
- Vadim M Gumerov
- Department of Microbiology and Translational Data Analytics Institute, The Ohio State University, Columbus, OH, USA
| | - Igor B Zhulin
- Department of Microbiology and Translational Data Analytics Institute, The Ohio State University, Columbus, OH, USA
| |
Collapse
|
3
|
Bhatt V, Mohapatra A, Anand S, Kuntal BK, Mande SS. FLIM-MAP: Gene Context Based Identification of Functional Modules in Bacterial Metabolic Pathways. Front Microbiol 2018; 9:2183. [PMID: 30283416 PMCID: PMC6157337 DOI: 10.3389/fmicb.2018.02183] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Accepted: 08/24/2018] [Indexed: 01/18/2023] Open
Abstract
Prediction of functional potential of bacteria can only be ascertained by the accurate annotation of its metabolic pathways. Homology based methods decipher metabolic gene content but ignore the fact that homologs of same protein can function in different pathways. Therefore, mere presence of all constituent genes in an organism is not sufficient to indicate a pathway. Contextual occurrence of genes belonging to a pathway on the bacterial genome can hence be exploited for an accurate estimation of functional potential of a bacterium. In this communication, we present a novel annotation resource to accurately identify pathway presence by using gene context. Our tool FLIM-MAP (Functionally Important Modules in bacterial Metabolic Pathways) predicts biologically relevant functional units called ‘GCMs’ (Gene Context based Modules) from a given metabolic reaction network. We benchmark the accuracy of our tool on amino acids and carbohydrate metabolism pathways.
Collapse
Affiliation(s)
- Vineet Bhatt
- Bio-Sciences R&D Division, TCS Research, Tata Consultancy Services Ltd., Pune, India
| | - Anwesha Mohapatra
- Bio-Sciences R&D Division, TCS Research, Tata Consultancy Services Ltd., Pune, India
| | - Swadha Anand
- Bio-Sciences R&D Division, TCS Research, Tata Consultancy Services Ltd., Pune, India
| | - Bhusan K Kuntal
- Bio-Sciences R&D Division, TCS Research, Tata Consultancy Services Ltd., Pune, India.,Chemical Engineering and Process Development Division, CSIR-National Chemical Laboratory (NCL), Pune, India,Academy of Scientific and Innovative Research (AcSIR), CSIR-National Chemical Laboratory, Pune, India
| | - Sharmila S Mande
- Bio-Sciences R&D Division, TCS Research, Tata Consultancy Services Ltd., Pune, India
| |
Collapse
|
4
|
Crawley AB, Barrangou R. Conserved Genome Organization and Core Transcriptome of the Lactobacillus acidophilus Complex. Front Microbiol 2018; 9:1834. [PMID: 30150974 PMCID: PMC6099100 DOI: 10.3389/fmicb.2018.01834] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 07/23/2018] [Indexed: 01/08/2023] Open
Abstract
The Lactobacillus genus encompasses a genetically and functionally diverse group of species, and contains many strains widely formulated in the human food supply chain as probiotics and starter cultures. Within this genetically expansive group, there are several distinct clades that have high levels of homology, one of which is the Lactobacillus acidophilus group. Of the uniting features, small genomes, low GC content, adaptation to dairy environments, and fastidious growth requirements, are some of the most defining characteristics of this group. To better understand what truly links and defines this clade, we sought to characterize the genomic organization and content of the genomes of several members of this group. Through core genome analysis we explored the synteny and intrinsic genetic underpinnings of the L. acidophilus clade, and observed key features related to the evolution and adaptation of these organisms. While genetic content is able to provide a large map of the potential of each organism, it does not always reflect their functionality. Through transcriptomic data we inferred the core transcriptome of the L. acidophilus complex to better define the true metabolic capabilities that unite this clade. Using this approach we have identified seven small ORFs that are both highly conserved and transcribed in diverse members of this clade and could be potential novel small peptide or untranslated RNA regulators. Overall, our results reveal the core features of the L. acidophilus complex and open new avenues for the enhancement and formulation and of next generation probiotics and starter cultures.
Collapse
Affiliation(s)
- Alexandra B Crawley
- Genomic Sciences Program, NC State University, Raleigh, NC, United States.,Department of Food, Bioprocessing and Nutrition Sciences, NC State University, Raleigh, NC, United States
| | - Rodolphe Barrangou
- Genomic Sciences Program, NC State University, Raleigh, NC, United States.,Department of Food, Bioprocessing and Nutrition Sciences, NC State University, Raleigh, NC, United States
| |
Collapse
|
5
|
Abstract
The study of evolutionary relationships among protein sequences was one of the first applications of bioinformatics. Since then, and accompanying the wealth of biological data produced by genome sequencing and other high-throughput techniques, the use of bioinformatics in general and phylogenetics in particular has been gaining ground in the study of protein and proteome evolution. Nowadays, the use of phylogenetics is instrumental not only to infer the evolutionary relationships among species and their genome sequences, but also to reconstruct ancestral states of proteins and proteomes and hence trace the paths followed by evolution. Here I survey recent progress in the elucidation of mechanisms of protein and proteome evolution in which phylogenetics has played a determinant role.
Collapse
Affiliation(s)
- Toni Gabaldón
- Bioinformatics Department, Centro de Investigación Principe Felipe
| |
Collapse
|
6
|
Bouyioukos C, Elati M, Képès F. Analysis tools for the interplay between genome layout and regulation. BMC Bioinformatics 2016; 17 Suppl 5:191. [PMID: 27294345 PMCID: PMC4905612 DOI: 10.1186/s12859-016-1047-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Genome layout and gene regulation appear to be interdependent. Understanding this interdependence is key to exploring the dynamic nature of chromosome conformation and to engineering functional genomes. Evidence for non-random genome layout, defined as the relative positioning of either co-functional or co-regulated genes, stems from two main approaches. Firstly, the analysis of contiguous genome segments across species, has highlighted the conservation of gene arrangement (synteny) along chromosomal regions. Secondly, the study of long-range interactions along a chromosome has emphasised regularities in the positioning of microbial genes that are co-regulated, co-expressed or evolutionarily correlated. While one-dimensional pattern analysis is a mature field, it is often powerless on biological datasets which tend to be incomplete, and partly incorrect. Moreover, there is a lack of comprehensive, user-friendly tools to systematically analyse, visualise, integrate and exploit regularities along genomes. RESULTS Here we present the Genome REgulatory and Architecture Tools SCAN (GREAT:SCAN) software for the systematic study of the interplay between genome layout and gene expression regulation. GREAT SCAN is a collection of related and interconnected applications currently able to perform systematic analyses of genome regularities as well as to improve transcription factor binding sites (TFBS) and gene regulatory network predictions based on gene positional information. CONCLUSIONS We demonstrate the capabilities of these tools by studying on one hand the regular patterns of genome layout in the major regulons of the bacterium Escherichia coli. On the other hand, we demonstrate the capabilities to improve TFBS prediction in microbes. Finally, we highlight, by visualisation of multivariate techniques, the interplay between position and sequence information for effective transcription regulation.
Collapse
Affiliation(s)
- Costas Bouyioukos
- />institute of Systems and Synthetic Biology (iSSB), Genopole, CNRS, Université d’Évry Val d’Essonne, Évry, France
| | - Mohamed Elati
- />institute of Systems and Synthetic Biology (iSSB), Genopole, CNRS, Université d’Évry Val d’Essonne, Évry, France
| | - François Képès
- />institute of Systems and Synthetic Biology (iSSB), Genopole, CNRS, Université d’Évry Val d’Essonne, Évry, France
- />Department of BioEngineering, Imperial College London, London, United Kingdom
| |
Collapse
|
7
|
Bouyioukos C, Bucchini F, Elati M, Képès F. GREAT: a web portal for Genome Regulatory Architecture Tools. Nucleic Acids Res 2016; 44:W77-82. [PMID: 27151196 PMCID: PMC4987929 DOI: 10.1093/nar/gkw384] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2016] [Accepted: 04/26/2016] [Indexed: 11/15/2022] Open
Abstract
GREAT (Genome REgulatory Architecture Tools) is a novel web portal for tools designed to generate user-friendly and biologically useful analysis of genome architecture and regulation. The online tools of GREAT are freely accessible and compatible with essentially any operating system which runs a modern browser. GREAT is based on the analysis of genome layout -defined as the respective positioning of co-functional genes- and its relation with chromosome architecture and gene expression. GREAT tools allow users to systematically detect regular patterns along co-functional genomic features in an automatic way consisting of three individual steps and respective interactive visualizations. In addition to the complete analysis of regularities, GREAT tools enable the use of periodicity and position information for improving the prediction of transcription factor binding sites using a multi-view machine learning approach. The outcome of this integrative approach features a multivariate analysis of the interplay between the location of a gene and its regulatory sequence. GREAT results are plotted in web interactive graphs and are available for download either as individual plots, self-contained interactive pages or as machine readable tables for downstream analysis. The GREAT portal can be reached at the following URL https://absynth.issb.genopole.fr/GREAT and each individual GREAT tool is available for downloading.
Collapse
Affiliation(s)
- Costas Bouyioukos
- iSSB, CNRS, Genopole, UEVE, Université Paris-Saclay, 5 rue Henri Desbruères, Évry 91030 Cedex, France
| | - François Bucchini
- iSSB, CNRS, Genopole, UEVE, Université Paris-Saclay, 5 rue Henri Desbruères, Évry 91030 Cedex, France
| | - Mohamed Elati
- iSSB, CNRS, Genopole, UEVE, Université Paris-Saclay, 5 rue Henri Desbruères, Évry 91030 Cedex, France
| | - François Képès
- iSSB, CNRS, Genopole, UEVE, Université Paris-Saclay, 5 rue Henri Desbruères, Évry 91030 Cedex, France
| |
Collapse
|
8
|
Abstract
Background Genes occurring co-localized in multiple genomes can be strong indicators for either functional constraints on the genome organization or remnant ancestral gene order. The computational detection of these patterns, which are usually referred to as gene clusters, has become increasingly sensitive over the past decade. The most powerful approaches allow for various types of imperfect cluster conservation: Cluster locations may be internally rearranged. The individual cluster locations may contain only a subset of the cluster genes and may be disrupted by uninvolved genes. Moreover cluster locations may not at all occur in some or even most of the studied genomes. The detection of such low quality clusters increases the risk of mistaking faint patterns that occur merely by chance for genuine findings. Therefore, it is crucial to estimate the significance of computational gene cluster predictions and discriminate between true conservation and coincidental clustering. Results In this paper, we present an efficient and accurate approach to estimate the significance of gene cluster predictions under the approximate common intervals model. Given a single gene cluster prediction, we calculate the probability to observe it with the same or a higher degree of conservation under the null hypothesis of random gene order, and add a correction factor to account for multiple testing. Our approach considers all parameters that define the quality of gene cluster conservation: the number of genomes in which the cluster occurs, the number of involved genes, the degree of conservation in the different genomes, as well as the frequency of the clustered genes within each genome. We apply our approach to evaluate gene cluster predictions in a large set of well annotated genomes.
Collapse
|
9
|
Galperin MY, Koonin EV. Comparative Genomics Approaches to Identifying Functionally Related Genes. ALGORITHMS FOR COMPUTATIONAL BIOLOGY 2014. [DOI: 10.1007/978-3-319-07953-0_1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
10
|
Abstract
Next-generation sequencing projects continue to drive a vast accumulation of metagenomic sequence data. Given the growth rate of this data, automated approaches to functional annotation are indispensable and a cornerstone heuristic of many computational protocols is the concept of guilt by association. The guilt by association paradigm has been heavily exploited by genomic context methods that offer functional predictions that are complementary to homology-based annotations, thereby offering a means to extend functional annotation. In particular, operon methods that exploit co-directional intergenic distances can provide homology-free functional annotation through the transfer of functions among co-operonic genes, under the assumption that guilt by association is indeed applicable. Although guilt by association is a well-accepted annotative device, its applicability to metagenomic functional annotation has not been definitively demonstrated. Here a large-scale assessment of metagenomic guilt by association is undertaken where functional associations are predicted on the basis of co-directional intergenic distances. Specifically, functional annotations are compared within pairs of adjacent co-directional genes, as well as operons of various lengths (i.e. number of member genes), in order to reveal new information about annotative cohesion versus operon length. The results suggests that co-directional gene pairs offer reduced confidence for metagenomic guilt by association due to difficulty in resolving the existence of functional associations when intergenic distance is the sole predictor of pairwise gene interactions. However, metagenomic operons, particularly those with substantial lengths, appear to be capable of providing a superior basis for metagenomic guilt by association due to increased annotative stability. The need for improved recognition of metagenomic operons is discussed, as well as the limitations of the present work.
Collapse
Affiliation(s)
- Gregory Vey
- Department of Biology, University of Waterloo, Waterloo, Ontario, Canada.
| |
Collapse
|
11
|
Cohen O, Ashkenazy H, Levy Karin E, Burstein D, Pupko T. CoPAP: Coevolution of presence-absence patterns. Nucleic Acids Res 2013; 41:W232-7. [PMID: 23748951 PMCID: PMC3692100 DOI: 10.1093/nar/gkt471] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Evolutionary analysis of phyletic patterns (phylogenetic profiles) is widely used in biology, representing presence or absence of characters such as genes, restriction sites, introns, indels and methylation sites. The phyletic pattern observed in extant genomes is the result of ancestral gain and loss events along the phylogenetic tree. Here we present CoPAP (coevolution of presence–absence patterns), a user-friendly web server, which performs accurate inference of coevolving characters as manifested by co-occurring gains and losses. CoPAP uses state-of-the-art probabilistic methodologies to infer coevolution and allows for advanced network analysis and visualization. We developed a platform for comparing different algorithms that detect coevolution, which includes simulated data with pairs of coevolving sites and independent sites. Using these simulated data we demonstrate that CoPAP performance is higher than alternative methods. We exemplify CoPAP utility by analyzing coevolution among thousands of bacterial genes across 681 genomes. Clusters of coevolving genes that were detected using our method largely coincide with known biosynthesis pathways and cellular modules, thus exhibiting the capability of CoPAP to infer biologically meaningful interactions. CoPAP is freely available for use at http://copap.tau.ac.il/.
Collapse
Affiliation(s)
- Ofir Cohen
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Ramat Aviv 69978, Israel
| | | | | | | | | |
Collapse
|
12
|
Abstract
MOTIVATION Correlated events of gains and losses enable inference of co-evolution relations. The reconstruction of the co-evolutionary interactions network in prokaryotic species may elucidate functional associations among genes. RESULTS We developed a novel probabilistic methodology for the detection of co-evolutionary interactions between pairs of genes. Using this method we inferred the co-evolutionary network among 4593 Clusters of Orthologous Genes (COGs). The number of co-evolutionary interactions substantially differed among COGs. Over 40% were found to co-evolve with at least one partner. We partitioned the network of co-evolutionary relations into clusters and uncovered multiple modular assemblies of genes with clearly defined functions. Finally, we measured the extent to which co-evolutionary relations coincide with other cellular relations such as genomic proximity, gene fusion propensity, co-expression, protein-protein interactions and metabolic connections. Our results show that co-evolutionary relations only partially overlap with these other types of networks. Our results suggest that the inferred co-evolutionary network in prokaryotes is highly informative towards revealing functional relations among genes, often showing signals that cannot be extracted from other network types. AVAILABILITY AND IMPLEMENTATION Available under GPL license as open source. CONTACT talp@post.tau.ac.il. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ofir Cohen
- Department of Cell Research and Immunology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv 69978, Israel
| | | | | | | |
Collapse
|
13
|
Jahn K. Efficient computation of approximate gene clusters based on reference occurrences. J Comput Biol 2012; 18:1255-74. [PMID: 21899430 DOI: 10.1089/cmb.2011.0132] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Whole genome comparison based on the analysis of gene cluster conservation has become a popular approach in comparative genomics. While gene order and gene content as a whole randomize over time, it is observed that certain groups of genes which are often functionally related remain co-located across species. However, the conservation is usually not perfect which turns the identification of these structures, often referred to as approximate gene clusters, into a challenging task. In this article, we present an efficient set distance based approach that computes approximate gene clusters by means of reference occurrences. We show that it yields highly comparable results to the corresponding non-reference based approach, while its polynomial runtime allows for approximate gene cluster detection in parameter ranges that used to be feasible only with simpler, e.g., max-gap based, gene cluster models. To illustrate further the performance and predictive power of our algorithm, we compare it to a state-of-the art approach for max-gap gene cluster computation.
Collapse
Affiliation(s)
- Katharina Jahn
- AG Genominformatik, Technische Fakultät, Universität Bielefeld, Bielefeld, Germany.
| |
Collapse
|
14
|
Yelton AP, Thomas BC, Simmons SL, Wilmes P, Zemla A, Thelen MP, Justice N, Banfield JF. A semi-quantitative, synteny-based method to improve functional predictions for hypothetical and poorly annotated bacterial and archaeal genes. PLoS Comput Biol 2011; 7:e1002230. [PMID: 22028637 PMCID: PMC3197636 DOI: 10.1371/journal.pcbi.1002230] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2011] [Accepted: 08/30/2011] [Indexed: 11/19/2022] Open
Abstract
During microbial evolution, genome rearrangement increases with increasing sequence divergence. If the relationship between synteny and sequence divergence can be modeled, gene clusters in genomes of distantly related organisms exhibiting anomalous synteny can be identified and used to infer functional conservation. We applied the phylogenetic pairwise comparison method to establish and model a strong correlation between synteny and sequence divergence in all 634 available Archaeal and Bacterial genomes from the NCBI database and four newly assembled genomes of uncultivated Archaea from an acid mine drainage (AMD) community. In parallel, we established and modeled the trend between synteny and functional relatedness in the 118 genomes available in the STRING database. By combining these models, we developed a gene functional annotation method that weights evolutionary distance to estimate the probability of functional associations of syntenous proteins between genome pairs. The method was applied to the hypothetical proteins and poorly annotated genes in newly assembled acid mine drainage Archaeal genomes to add or improve gene annotations. This is the first method to assign possible functions to poorly annotated genes through quantification of the probability of gene functional relationships based on synteny at a significant evolutionary distance, and has the potential for broad application.
Collapse
Affiliation(s)
- Alexis P. Yelton
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, California, United States of America
| | - Brian C. Thomas
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, California, United States of America
| | - Sheri L. Simmons
- Department of Earth and Planetary Sciences, University of California, Berkeley, California, United States of America
| | - Paul Wilmes
- Department of Earth and Planetary Sciences, University of California, Berkeley, California, United States of America
| | - Adam Zemla
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, United States of America
| | - Michael P. Thelen
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, United States of America
| | - Nicholas Justice
- Department of Plant and Microbial Biology, University of California, Berkeley, California, United States of America
| | - Jillian F. Banfield
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, California, United States of America
- Department of Earth and Planetary Sciences, University of California, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|
15
|
Zhang Y, Gladyshev VN. Comparative Genomics of Trace Elements: Emerging Dynamic View of Trace Element Utilization and Function. Chem Rev 2009; 109:4828-61. [DOI: 10.1021/cr800557s] [Citation(s) in RCA: 99] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Affiliation(s)
- Yan Zhang
- Department of Biochemistry and Redox Biology Center, University of Nebraska, Lincoln, Nebraska 68588-0664
| | - Vadim N. Gladyshev
- Department of Biochemistry and Redox Biology Center, University of Nebraska, Lincoln, Nebraska 68588-0664
| |
Collapse
|
16
|
Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 2008; 36:6688-719. [PMID: 18948295 PMCID: PMC2588523 DOI: 10.1093/nar/gkn668] [Citation(s) in RCA: 534] [Impact Index Per Article: 33.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
The first bacterial genome was sequenced in 1995, and the first archaeal genome in 1996. Soon after these breakthroughs, an exponential rate of genome sequencing was established, with a doubling time of approximately 20 months for bacteria and approximately 34 months for archaea. Comparative analysis of the hundreds of sequenced bacterial and dozens of archaeal genomes leads to several generalizations on the principles of genome organization and evolution. A crucial finding that enables functional characterization of the sequenced genomes and evolutionary reconstruction is that the majority of archaeal and bacterial genes have conserved orthologs in other, often, distant organisms. However, comparative genomics also shows that horizontal gene transfer (HGT) is a dominant force of prokaryotic evolution, along with the loss of genetic material resulting in genome contraction. A crucial component of the prokaryotic world is the mobilome, the enormous collection of viruses, plasmids and other selfish elements, which are in constant exchange with more stable chromosomes and serve as HGT vehicles. Thus, the prokaryotic genome space is a tightly connected, although compartmentalized, network, a novel notion that undermines the ‘Tree of Life’ model of evolution and requires a new conceptual framework and tools for the study of prokaryotic evolution.
Collapse
Affiliation(s)
- Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | | |
Collapse
|
17
|
Gonzalez O, Zimmer R. Assigning functional linkages to proteins using phylogenetic profiles and continuous phenotypes. ACTA ACUST UNITED AC 2008; 24:1257-63. [PMID: 18381403 DOI: 10.1093/bioinformatics/btn106] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION A class of non-homology-based methods for protein function prediction relies on the assumption that genes linked to a phenotypic trait are preferentially conserved among organisms that share the trait. These methods typically compare pairs of binary strings, where one string encodes the phylogenetic distribution of a trait and the other of a protein. In this work, we extended the approach to automatically deal with continuous phenotypes. RESULTS Rather than use a priori rules, which can be very subjective, to construct binary profiles from continuous phenotypes, we propose to systematically explore thresholds which can meaningfully separate the phenotype values. We illustrate our method by analyzing optimal growth temperatures, and demonstrate its usefulness by automatically retrieving genes which have been associated with thermophilic growth. We also apply the general approach, for the first time, to optimal growth pH, and make novel predictions. Finally, we show that our method can also be applied to other properties which may not be classically considered as phenotypes. Specifically, we studied correlations between genome size and the distribution of genes.
Collapse
Affiliation(s)
- Orland Gonzalez
- Institute for Informatics, Ludwig-Maximilians-Universität München, Amalienstr. 17, 80333 Munich, Germany.
| | | |
Collapse
|
18
|
Gabaldón T. Computational approaches for the prediction of protein function in the mitochondrion. Am J Physiol Cell Physiol 2006; 291:C1121-8. [PMID: 16870830 DOI: 10.1152/ajpcell.00225.2006] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Understanding a complex biological system, such as the mitochondrion, requires the identification of the complete repertoire of proteins targeted to the organelle, the characterization of these, and finally, the elucidation of the functional and physical interactions that occur within the mitochondrion. In the last decade, significant developments have contributed to increase our understanding of the mitochondrion, and among these, computational research has played a significant role. Not only general bioinformatics tools have been applied in the context of the mitochondrion, but also some computational techniques have been specifically developed to address problems that arose from within the mitochondrial research field. In this review the contribution of bioinformatics to mitochondrial biology is addressed through a survey of current computational methods that can be applied to predict which proteins will be localized to the mitochondrion and to unravel their functional interactions.
Collapse
Affiliation(s)
- Toni Gabaldón
- Bioinformatics Department, Centro de Investigación Príncipe Felipe, Valencia, Spain.
| |
Collapse
|
19
|
Ettema TJG, de Vos WM, van der Oost J. Discovering novel biology by in silico archaeology. Nat Rev Microbiol 2005; 3:859-69. [PMID: 16175172 DOI: 10.1038/nrmicro1268] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Archaea are prokaryotes that evolved in parallel with bacteria. Since the discovery of the distinct status of the Archaea, extensive physiological and biochemical research has been conducted to elucidate the molecular basis of their remarkable lifestyle and their unique biology. Here, we discuss how in-depth comparative genomics has been used to improve the annotation of archaeal genomes. Combined with experimental verification, bioinformatic analysis contributes to the ongoing discovery of novel metabolic conversions and control mechanisms, and as such to a better understanding of the intriguing biology of the Archaea.
Collapse
Affiliation(s)
- Thijs J G Ettema
- Laboratory of Microbiology, Wageningen University, 6703 CT Wageningen, The Netherlands
| | | | | |
Collapse
|
20
|
Huynen MA, Gabaldón T, Snel B. Variation and evolution of biomolecular systems: Searching for functional relevance. FEBS Lett 2005; 579:1839-45. [PMID: 15763561 DOI: 10.1016/j.febslet.2005.02.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2005] [Revised: 01/18/2005] [Accepted: 02/01/2005] [Indexed: 11/29/2022]
Abstract
The availability of genome sequences and functional genomics data from multiple species enables us to compare the composition of biomolecular systems like biochemical pathways and protein complexes between species. Here, we review small- and large-scale, "genomics-based" approaches to biomolecular systems variation. In general, caution is required when comparing the results of bioinformatics analyses of genomes or of functional genomics data between species. Limitations to the sensitivity of sequence analysis tools and the noisy nature of genomics data tend to lead to systematic overestimates of the amount of variation. Nevertheless, the results from detailed manual analyses, and of large-scale analyses that filter out systematic biases, point to a large amount of variation in the composition of biomolecular systems. Such observations challenge our understanding of the function of the systems and their individual components and can potentially facilitate the identification and functional characterization of sub-systems within a system. Mapping the inter-species variation of complex biomolecular systems on a phylogenetic species tree allows one to reconstruct their evolution.
Collapse
Affiliation(s)
- Martijn A Huynen
- Center for Molecular and Biomolecular Informatics, Nijmegen Center for Molecular Life Sciences, Radboud University Nijmegen Medical Center, P.O. Box 9010, 6500 GL Nijmegen, The Netherlands.
| | | | | |
Collapse
|
21
|
Zientz E, Dandekar T, Gross R. Metabolic interdependence of obligate intracellular bacteria and their insect hosts. Microbiol Mol Biol Rev 2005; 68:745-70. [PMID: 15590782 PMCID: PMC539007 DOI: 10.1128/mmbr.68.4.745-770.2004] [Citation(s) in RCA: 231] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Mutualistic associations of obligate intracellular bacteria and insects have attracted much interest in the past few years due to the evolutionary consequences for their genome structure. However, much less attention has been paid to the metabolic ramifications for these endosymbiotic microorganisms, which have to compete with but also to adapt to another metabolism--that of the host cell. This review attempts to provide insights into the complex physiological interactions and the evolution of metabolic pathways of several mutualistic bacteria of aphids, ants, and tsetse flies and their insect hosts.
Collapse
Affiliation(s)
- Evelyn Zientz
- Lehrstuhl für Mikrobiologie, Biozentrum der Universität Würzburg, Theodor-Boveri-Institut, Am Hubland, D-97074 Würzburg, Germany
| | | | | |
Collapse
|
22
|
Korbel JO, Jensen LJ, von Mering C, Bork P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat Biotechnol 2005; 22:911-7. [PMID: 15229555 DOI: 10.1038/nbt988] [Citation(s) in RCA: 136] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Several widely used methods for predicting functional associations between proteins are based on the systematic analysis of genomic context. Efforts are ongoing to improve these methods and to search for novel aspects in genomes that could be exploited for function prediction. Here, we use gene expression data to demonstrate two functional implications of genome organization: first, chromosomal proximity indicates gene coregulation in prokaryotes independent of relative gene orientation; and second, adjacent bidirectionally transcribed genes (that is,'divergently' organized coding regions) with conserved gene orientation are strongly coregulated. We further demonstrate that such bidirectionally transcribed gene pairs are functionally associated and derive from this a novel genomic context method that reliably predicts links between >2,500 pairs of genes in approximately 100 species. Around 650 of these functional associations are supported by other genomic context methods. In most instances, one gene encodes a transcriptional regulator, and the other a nonregulatory protein. In-depth analysis in Escherichia coli shows that the vast majority of these regulators both control transcription of the divergently transcribed target gene/operon and auto-regulate their own biosynthesis. The method thus enables the prediction of target processes and regulatory features for several hundred transcriptional regulators.
Collapse
Affiliation(s)
- Jan O Korbel
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | | | | | | |
Collapse
|
23
|
Tasneem A, Iyer LM, Jakobsson E, Aravind L. Identification of the prokaryotic ligand-gated ion channels and their implications for the mechanisms and origins of animal Cys-loop ion channels. Genome Biol 2004; 6:R4. [PMID: 15642096 PMCID: PMC549065 DOI: 10.1186/gb-2004-6-1-r4] [Citation(s) in RCA: 191] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2004] [Revised: 10/26/2004] [Accepted: 11/24/2004] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Acetylcholine receptor type ligand-gated ion channels (ART-LGIC; also known as Cys-loop receptors) are a superfamily of proteins that include the receptors for major neurotransmitters such as acetylcholine, serotonin, glycine, GABA, glutamate and histamine, and for Zn2+ ions. They play a central role in fast synaptic signaling in animal nervous systems and so far have not been found outside of the Metazoa. RESULTS Using sensitive sequence-profile searches we have identified homologs of ART-LGICs in several bacteria and a single archaeal genus, Methanosarcina. The homology between the animal receptors and the prokaryotic homologs spans the entire length of the former, including both the ligand-binding and channel-forming transmembrane domains. A sequence-structure analysis using the structure of Lymnaea stagnalis acetylcholine-binding protein and the newly detected prokaryotic versions indicates the presence of at least one aromatic residue in the ligand-binding boxes of almost all representatives of the superfamily. Investigation of the domain architectures of the bacterial forms shows that they may often show fusions with other small-molecule-binding domains, such as the periplasmic binding protein superfamily I (PBP-I), Cache and MCP-N domains. Some of the bacterial forms also occur in predicted operons with the genes of the PBP-II superfamily and the Cache domains. Analysis of phyletic patterns suggests that the ART-LGICs are currently absent in all other eukaryotic lineages except animals. Moreover, phylogenetic analysis and conserved sequence motifs also suggest that a subset of the bacterial forms is closer to the metazoan forms. CONCLUSIONS From the information from the bacterial forms we infer that cation-pi or hydrophobic interactions with the ligand are likely to be a pervasive feature of the entire superfamily, even though the individual residues involved in the process may vary. The conservation pattern in the channel-forming transmembrane domains also suggests similar channel-gating mechanisms in the prokaryotic versions. From the distribution of charged residues in the prokaryotic M2 transmembrane segments, we expect that there will be examples of both cation and anion selectivity within the prokaryotic members. Contextual connections suggest that the prokaryotic forms may function as chemotactic receptors for low molecular weight solutes. The phyletic patterns and phylogenetic relationships suggest the possibility that the metazoan receptors emerged through an early lateral transfer from a prokaryotic source, before the divergence of extant metazoan lineages.
Collapse
Affiliation(s)
- Asba Tasneem
- Beckman Institute, University of Illinois at Urbana-Champaign, 405 N Mathews Avenue, Urbana, IL 61801, USA
| | - Lakshminarayan M Iyer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Eric Jakobsson
- Beckman Institute, University of Illinois at Urbana-Champaign, 405 N Mathews Avenue, Urbana, IL 61801, USA
| | - L Aravind
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
24
|
Iyer LM, Leipe DD, Koonin EV, Aravind L. Evolutionary history and higher order classification of AAA+ ATPases. J Struct Biol 2004; 146:11-31. [PMID: 15037234 DOI: 10.1016/j.jsb.2003.10.010] [Citation(s) in RCA: 608] [Impact Index Per Article: 30.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2003] [Revised: 10/08/2003] [Indexed: 12/29/2022]
Abstract
The AAA+ ATPases are enzymes containing a P-loop NTPase domain, and function as molecular chaperones, ATPase subunits of proteases, helicases or nucleic-acid-stimulated ATPases. All available sequences and structures of AAA+ protein domains were compared with the aim of identifying the definitive sequence and structure features of these domains and inferring the principal events in their evolution. An evolutionary classification of the AAA+ class was developed using standard phylogenetic methods, analysis of shared sequence and structural signatures, and similarity-based clustering. This analysis resulted in the identification of 26 major families within the AAA+ ATPase class. We also describe the position of the AAA+ ATPases with respect to the RecA/F1, helicase superfamilies I/II, PilT, and ABC classes of P-loop NTPases. The AAA+ class appears to have undergone an early radiation into the clamp-loader, DnaA/Orc/Cdc6, classic AAA, and "pre-sensor 1 beta-hairpin" (PS1BH) clades. Within the PS1BH clade, chelatases, MoxR, YifB, McrB, Dynein-midasin, NtrC, and MCMs form a monophyletic assembly defined by a distinct insert in helix-2 of the conserved ATPase core, and additional helical segment between the core ATPase domain and the C-terminal alpha-helical bundle. At least 6 distinct AAA+ proteins, which represent the different major clades, are traceable to the last universal common ancestor (LUCA) of extant cellular life. Additionally, superfamily III helicases, which belong to the PS1BH assemblage, were probably present at this stage in virus-like "selfish" replicons. The next major radiation, at the base of the two prokaryotic kingdoms, bacteria and archaea, gave rise to several distinct chaperones, ATPase subunits of proteases, DNA helicases, and transcription factors. The third major radiation, at the outset of eukaryotic evolution, contributed to the origin of several eukaryote-specific adaptations related to nuclear and cytoskeletal functions. The new relationships and previously undetected domains reported here might provide new leads for investigating the biology of AAA+ ATPases.
Collapse
Affiliation(s)
- Lakshminarayan M Iyer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
25
|
Iyer LM, Makarova KS, Koonin EV, Aravind L. Comparative genomics of the FtsK-HerA superfamily of pumping ATPases: implications for the origins of chromosome segregation, cell division and viral capsid packaging. Nucleic Acids Res 2004; 32:5260-79. [PMID: 15466593 PMCID: PMC521647 DOI: 10.1093/nar/gkh828] [Citation(s) in RCA: 246] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Recently, it has been shown that a predicted P-loop ATPase (the HerA or MlaA protein), which is highly conserved in archaea and also present in many bacteria but absent in eukaryotes, has a bidirectional helicase activity and forms hexameric rings similar to those described for the TrwB ATPase. In this study, the FtsK-HerA superfamily of P-loop ATPases, in which the HerA clade comprises one of the major branches, is analyzed in detail. We show that, in addition to the FtsK and HerA clades, this superfamily includes several families of characterized or predicted ATPases which are predominantly involved in extrusion of DNA and peptides through membrane pores. The DNA-packaging ATPases of various bacteriophages and eukaryotic double-stranded DNA viruses also belong to the FtsK-HerA superfamily. The FtsK protein is the essential bacterial ATPase that is responsible for the correct segregation of daughter chromosomes during cell division. The structural and evolutionary relationship between HerA and FtsK and the nearly perfect complementarity of their phyletic distributions suggest that HerA similarly mediates DNA pumping into the progeny cells during archaeal cell division. It appears likely that the HerA and FtsK families diverged concomitantly with the archaeal-bacterial division and that the last universal common ancestor of modern life forms had an ancestral DNA-pumping ATPase that gave rise to these families. Furthermore, the relationship of these cellular proteins with the packaging ATPases of diverse DNA viruses suggests that a common DNA pumping mechanism might be operational in both cellular and viral genome segregation. The herA gene forms a highly conserved operon with the gene for the NurA nuclease and, in many archaea, also with the orthologs of eukaryotic double-strand break repair proteins MRE11 and Rad50. HerA is predicted to function in a complex with these proteins in DNA pumping and repair of double-stranded breaks introduced during this process and, possibly, also during DNA replication. Extensive comparative analysis of the 'genomic context' combined with in-depth sequence analysis led to the prediction of numerous previously unnoticed nucleases of the NurA superfamily, including a specific version that is likely to be the endonuclease component of a novel restriction-modification system. This analysis also led to the identification of previously uncharacterized nucleases, such as a novel predicted nuclease of the Sir2-type Rossmann fold, and phosphatases of the HAD superfamily that are likely to function as partners of the FtsK-HerA superfamily ATPases.
Collapse
Affiliation(s)
- Lakshminarayan M Iyer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | |
Collapse
|
26
|
Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 2004; 5:R7. [PMID: 14759257 PMCID: PMC395751 DOI: 10.1186/gb-2004-5-2-r7] [Citation(s) in RCA: 676] [Impact Index Per Article: 33.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2003] [Revised: 12/01/2003] [Accepted: 12/04/2003] [Indexed: 11/10/2022] Open
Abstract
We examined functional and evolutionary patterns in the recently constructed set of 5,873 clusters of predicted orthologs from seven eukaryotic genomes. The analysis reveals a conserved core of largely essential eukaryotic genes as well as major diversification and innovation associated with evolution of eukaryotic genomes. Background Sequencing the genomes of multiple, taxonomically diverse eukaryotes enables in-depth comparative-genomic analysis which is expected to help in reconstructing ancestral eukaryotic genomes and major events in eukaryotic evolution and in making functional predictions for currently uncharacterized conserved genes. Results We examined functional and evolutionary patterns in the recently constructed set of 5,873 clusters of predicted orthologs (eukaryotic orthologous groups or KOGs) from seven eukaryotic genomes: Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Encephalitozoon cuniculi. Conservation of KOGs through the phyletic range of eukaryotes strongly correlates with their functions and with the effect of gene knockout on the organism's viability. The approximately 40% of KOGs that are represented in six or seven species are enriched in proteins responsible for housekeeping functions, particularly translation and RNA processing. These conserved KOGs are often essential for survival and might approximate the minimal set of essential eukaryotic genes. The 131 single-member, pan-eukaryotic KOGs we identified were examined in detail. For around 20 that remained uncharacterized, functions were predicted by in-depth sequence analysis and examination of genomic context. Nearly all these proteins are subunits of known or predicted multiprotein complexes, in agreement with the balance hypothesis of evolution of gene copy number. Other KOGs show a variety of phyletic patterns, which points to major contributions of lineage-specific gene loss and the 'invention' of genes new to eukaryotic evolution. Examination of the sets of KOGs lost in individual lineages reveals co-elimination of functionally connected genes. Parsimonious scenarios of eukaryotic genome evolution and gene sets for ancestral eukaryotic forms were reconstructed. The gene set of the last common ancestor of the crown group consists of 3,413 KOGs and largely includes proteins involved in genome replication and expression, and central metabolism. Only 44% of the KOGs, mostly from the reconstructed gene set of the last common ancestor of the crown group, have detectable homologs in prokaryotes; the remainder apparently evolved via duplication with divergence and invention of new genes. Conclusions The KOG analysis reveals a conserved core of largely essential eukaryotic genes as well as major diversification and innovation associated with evolution of eukaryotic genomes. The results provide quantitative support for major trends of eukaryotic evolution noticed previously at the qualitative level and a basis for detailed reconstruction of evolution of eukaryotic genomes and biology of ancestral forms.
Collapse
Affiliation(s)
- Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Abstract
The apparati behind the replication, transcription, and translation of prokaryotic and eukaryotic genes are quite different. Yet in both classes of organisms, genes may be organized in their respective chromosomes in similar ways by virtue of similarly acting selective forces. In addition, some gene organizations reflect biology unique to each class of organisms. Levels of organization are more complex than those of the simple operon. Multiple transcription units may be organized into larger units, local control regions may act over large chromosomal regions in eukaryotic chromosomes, and cis-acting genes may control the expression of downstream genes in all classes of organisms. All these mechanisms lead to genomes being far more organized, in both prokaryotes and eukaryotes, than hitherto imagined.
Collapse
Affiliation(s)
- Jeffrey G Lawrence
- Pittsburgh Bacteriophage Institute, Department of Biological Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania 15260, USA.
| |
Collapse
|
28
|
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003; 4:41. [PMID: 12969510 PMCID: PMC222959 DOI: 10.1186/1471-2105-4-41] [Citation(s) in RCA: 3221] [Impact Index Per Article: 153.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2003] [Accepted: 09/11/2003] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies. RESULTS We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or approximately 54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of approximately 20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (approximately 1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes. CONCLUSION The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.
Collapse
Affiliation(s)
- Roman L Tatusov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Natalie D Fedorova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - John D Jackson
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Aviva R Jacobs
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Boris Kiryutin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Dmitri M Krylov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Raja Mazumder
- Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20007, USA
| | - Sergei L Mekhedov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Anastasia N Nikolskaya
- Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20007, USA
| | - B Sridhar Rao
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Sergei Smirnov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Alexander V Sverdlov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Sona Vasudevan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Jodie J Yin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda MD, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, 3900 Reservoir Road, NW, Washington, DC 20007, USA
| |
Collapse
|
29
|
Abstract
The evolution of enzymes and pathways is under debate. Recent studies show that recruitment of single enzymes from different pathways could be the driving force for pathway evolution. Other mechanisms of evolution, such as pathway duplication, enzyme specialization, de novo invention of pathways or retro-evolution of pathways, appear to be less abundant. Twenty percent of enzyme superfamilies are quite variable, not only in changing reaction chemistry or metabolite type but in changing both at the same time. These variable superfamilies account for nearly half of all known reactions. The most frequently occurring metabolites provide a helping hand for such changes because they can be accommodated by many enzyme superfamilies. Thus, a picture is emerging in which new pathways are evolving from central metabolites by preference, thereby keeping the overall topology of the metabolic network.
Collapse
Affiliation(s)
- Steffen Schmidt
- European Molecular Biology Laboratory Heidelberg, Postfach 102209, Germany
| | | | | | | |
Collapse
|
30
|
Abstract
We searched for genes that could be important for hyperthermophily using a flexible approach to phyletic pattern analysis. We identified 290 clusters of orthologous groups of proteins (COGs) that are preferentially present in archaeal and bacterial hyperthermophiles. Of these, 58 COGs include proteins from at least one bacterium and two archaea, and these were considered to be the best candidates for a specific association with the hyperthermophilic phenotype. Detailed sequence and genome-context analysis of these COGs led to functional predictions for several previously uncharacterized protein families, including a novel group of putative molecular chaperones and a unique transcriptional regulator.
Collapse
Affiliation(s)
- Kira S Makarova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
31
|
Rogozin IB, Makarova KS, Murvai J, Czabarka E, Wolf YI, Tatusov RL, Szekely LA, Koonin EV. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res 2002; 30:2212-23. [PMID: 12000841 PMCID: PMC115289 DOI: 10.1093/nar/30.10.2212] [Citation(s) in RCA: 130] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A computational method was developed for delineating connected gene neighborhoods in bacterial and archaeal genomes. These gene neighborhoods are not typically present, in their entirety, in any single genome, but are held together by overlapping, partially conserved gene arrays. The procedure was applied to comparing the orders of orthologous genes, which were extracted from the database of Clusters of Orthologous Groups of proteins (COGs), in 31 prokaryotic genomes and resulted in the identification of 188 clusters of gene arrays, which included 1001 of 2890 COGs. These clusters were projected onto actual genomes to produce extended neighborhoods including additional genes, which are adjacent to the genes from the clusters and are transcribed in the same direction, which resulted in a total of 2387 COGs being included in the neighborhoods. Most of the neighborhoods consist predominantly of genes united by a coherent functional theme, but also include a minority of genes without an obvious functional connection to the main theme. We hypothesize that although some of the latter genes might have unsuspected roles, others are maintained within gene arrays because of the advantage of expression at a level that is typical of the given neighborhood. We designate this phenomenon 'genomic hitchhiking'. The largest neighborhood includes 79 genes (COGs) and consists of overlapping, rearranged ribosomal protein superoperons; apparent genome hitchhiking is particularly typical of this neighborhood and other neighborhoods that consist of genes coding for translation machinery components. Several neighborhoods involve previously undetected connections between genes, allowing new functional predictions. Gene neighborhoods appear to evolve via complex rearrangement, with different combinations of genes from a neighborhood fixed in different lineages.
Collapse
Affiliation(s)
- Igor B Rogozin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | | | | | | | | | |
Collapse
|
32
|
Rison SCG, Teichmann SA, Thornton JM. Homology, pathway distance and chromosomal localization of the small molecule metabolism enzymes in Escherichia coli. J Mol Biol 2002; 318:911-32. [PMID: 12054833 DOI: 10.1016/s0022-2836(02)00140-7] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Here, we analyse Escherichia coli enzymes involved in small molecule metabolism (SMM). We introduce the concept of pathway distance as a measure of the number of distinct metabolic steps separating two SMM enzymes, and we consider protein homology (as determined by assigning enzymes to structural and sequence families) and gene interval (the number of genes separating two genes on the E. coli chromosome). The relationships between these three contexts (pathway distance, homology and chromosomal localisation) is investigated extensively. We make use of these relationships to suggest possible SMM evolution mechanisms. Homology between enzyme pairs close in the SMM was higher than expected by chance but was still rare. When observed, homologues usually conserved their reaction mechanism and/or co-factor binding rather than shared substrate binding. The correlation between pathway distance and gene intervals was clear. Enzymes catalysing nearby SMM reactions were usually encoded by genes close by on the E. coli chromosome. We found many co-regulated blocks of three to four genes (usually non-homologous) encoding enzymes occurring within four metabolic steps of one another; nearly all of these blocks formed part of known or predicted operons. The "inline reuse" of enzymes (i.e. the use of the same enzyme to catalyse two or more different steps of a metabolic pathway) is also discussed: of these enzymes, four were multifunctional (i.e. catalysed a different reaction in each instance), nine had multiple substrate specificity (i.e. catalysed the same reaction on different substrates in each instance) and one catalysed the same reaction on the same substrate but as part of two different complexes. We also identified 59 sets of isozymic proteins most commonly duplicated to function under different conditions, or with a different preferred substrate or minor substrate. In addition to transcriptional units, isozymes and inline reuse of enzymes provide mechanisms for controlling the SMM network. Our data suggest that several pathway evolution mechanisms may occur in concert, although chemistry-driven duplication/recruitment is favoured. SMM exploits regulatory strategies involving chromosomal location, isozymes and the reuse of enzymes.
Collapse
Affiliation(s)
- Stuart C G Rison
- Department of Biochemistry and Molecular Biology, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK
| | | | | |
Collapse
|
33
|
Yanai I, Mellor JC, DeLisi C. Identifying functional links between genes using conserved chromosomal proximity. Trends Genet 2002; 18:176-9. [PMID: 11932011 DOI: 10.1016/s0168-9525(01)02621-x] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
Conservation of proximity of a pair of genes across multiple genomes generally indicates that their functions could be linked. Here, we present a systematic evaluation using 42 complete microbial genomes from 25 phylogenetic groups to test the reliability of this observation in predicting function for genes. We find a relationship between the number of phylogenetic groups in which a gene pair is proximate and the probability that the pair belongs to a common pathway. Our method produces 1586 links between ortholog families substantiated by observed proximity in genomes representing at least three phylogenetic groups. Of the pairs annotated in the KEGG database, 80% are in the same biological pathway in KEGG.
Collapse
Affiliation(s)
- Itai Yanai
- Bioinformatics Graduate Program and Dept of Biomedical Engineering, Boston University, Boston, MA 02215, USA.
| | | | | |
Collapse
|
34
|
Abstract
With the increasing availability of genome sequences, new methods are being proposed that exploit information from complete genomes to classify species in a phylogeny. Here we present SHOT, a web server for the classification of genomes on the basis of shared gene content or the conservation of gene order that reflects the dominant, phylogenetic signal in these genomic properties. In general, the genome trees are consistent with classical gene-based phylogenies, although some interesting exceptions indicate massive horizontal gene transfer. SHOT is a useful tool for analysing the tree of life from a genomic point of view. It is available at http://www.Bork.EMBL-Heidelberg.de/SHOT.
Collapse
Affiliation(s)
- Jan O Korbel
- EMBL, Meyerhofstrasse 1, 69117, Heidelberg, Germany.
| | | | | | | |
Collapse
|
35
|
Yanai I, Wolf YI, Koonin EV. Evolution of gene fusions: horizontal transfer versus independent events. Genome Biol 2002; 3:research0024. [PMID: 12049665 PMCID: PMC115226 DOI: 10.1186/gb-2002-3-5-research0024] [Citation(s) in RCA: 62] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2001] [Revised: 02/07/2002] [Accepted: 03/26/2002] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Gene fusions can be used as tools for functional prediction and also as evolutionary markers. Fused genes often show a scattered phyletic distribution, which suggests a role for processes other than vertical inheritance in their evolution. RESULTS The evolutionary history of gene fusions was studied by phylogenetic analysis of the domains in the fused proteins and the orthologous domains that form stand-alone proteins. Clustering of fusion components from phylogenetically distant species was construed as evidence of dissemination of the fused genes by horizontal transfer. Of the 51 examined gene fusions that are represented in at least two of the three primary kingdoms (Bacteria, Archaea and Eukaryota), 31 were most probably disseminated by cross-kingdom horizontal gene transfer, whereas 14 appeared to have evolved independently in different kingdoms and two were probably inherited from the common ancestor of modern life forms. On many occasions, the evolutionary scenario also involves one or more secondary fissions of the fusion gene. For approximately half of the fusions, stand-alone forms of the fusion components are encoded by juxtaposed genes, which are known or predicted to belong to the same operon in some of the prokaryotic genomes. This indicates that evolution of gene fusions often, if not always, involves an intermediate stage, during which the future fusion components exist as juxtaposed and co-regulated, but still distinct, genes within operons. CONCLUSION These findings suggest a major role for horizontal transfer of gene fusions in the evolution of protein-domain architectures, but also indicate that independent fusions of the same pair of domains in distant species is not uncommon, which suggests positive selection for the multidomain architectures.
Collapse
MESH Headings
- DNA, Archaeal/genetics
- DNA, Bacterial/genetics
- DNA, Fungal/genetics
- Databases, Genetic
- Evolution, Molecular
- Gene Transfer, Horizontal/genetics
- Genes, Archaeal/genetics
- Genes, Bacterial/genetics
- Genes, Fungal/genetics
- Genome
- Genome, Bacterial
- Genome, Fungal
- Phylogeny
- Recombination, Genetic/genetics
- Sequence Homology, Nucleic Acid
Collapse
Affiliation(s)
- Itai Yanai
- Bioinformatics Graduate Program and Department of Biomedical Engineering, Boston University, Boston, MA 02215, USA
| | - Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MA 20894, USA
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MA 20894, USA
| |
Collapse
|
36
|
Snel B, Bork P, Huynen MA. Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res 2002; 12:17-25. [PMID: 11779827 DOI: 10.1101/gr.176501] [Citation(s) in RCA: 272] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
In the course of evolution, genomes are shaped by processes like gene loss, gene duplication, horizontal gene transfer, and gene genesis (the de novo origin of genes). Here we reconstruct the gene content of ancestral Archaea and Proteobacteria and quantify the processes connecting them to their present day representatives based on the distribution of genes in completely sequenced genomes. We estimate that the ancestor of the Proteobacteria contained around 2500 genes, and the ancestor of the Archaea around 2050 genes. Although it is necessary to invoke horizontal gene transfer to explain the content of present day genomes, gene loss, gene genesis, and simple vertical inheritance are quantitatively the most dominant processes in shaping the genome. Together they result in a turnover of gene content such that even the lineage leading from the ancestor of the Proteobacteria to the relatively large genome of Escherichia coli has lost at least 950 genes. Gene loss, unlike the other processes, correlates fairly well with time. This clock-like behavior suggests that gene loss is under negative selection, while the processes that add genes are under positive selection.
Collapse
Affiliation(s)
- Berend Snel
- European Molecular Biology Laboratory, 69117 Heidelberg, Germany.
| | | | | |
Collapse
|
37
|
Wolf YI, Rogozin IB, Grishin NV, Tatusov RL, Koonin EV. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol Biol 2001; 1:8. [PMID: 11734060 PMCID: PMC60490 DOI: 10.1186/1471-2148-1-8] [Citation(s) in RCA: 234] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2001] [Accepted: 10/23/2001] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND The availability of multiple complete genome sequences from diverse taxa prompts the development of new phylogenetic approaches, which attempt to incorporate information derived from comparative analysis of complete gene sets or large subsets thereof. Such attempts are particularly relevant because of the major role of horizontal gene transfer and lineage-specific gene loss, at least in the evolution of prokaryotes. RESULTS Five largely independent approaches were employed to construct trees for completely sequenced bacterial and archaeal genomes: i) presence-absence of genomes in clusters of orthologous genes; ii) conservation of local gene order (gene pairs) among prokaryotic genomes; iii) parameters of identity distribution for probable orthologs; iv) analysis of concatenated alignments of ribosomal proteins; v) comparison of trees constructed for multiple protein families. All constructed trees support the separation of the two primary prokaryotic domains, bacteria and archaea, as well as some terminal bifurcations within the bacterial and archaeal domains. Beyond these obvious groupings, the trees made with different methods appeared to differ substantially in terms of the relative contributions of phylogenetic relationships and similarities in gene repertoires caused by similar life styles and horizontal gene transfer to the tree topology. The trees based on presence-absence of genomes in orthologous clusters and the trees based on conserved gene pairs appear to be strongly affected by gene loss and horizontal gene transfer. The trees based on identity distributions for orthologs and particularly the tree made of concatenated ribosomal protein sequences seemed to carry a stronger phylogenetic signal. The latter tree supported three potential high-level bacterial clades,: i) Chlamydia-Spirochetes, ii) Thermotogales-Aquificales (bacterial hyperthermophiles), and ii) Actinomycetes-Deinococcales-Cyanobacteria. The latter group also appeared to join the low-GC Gram-positive bacteria at a deeper tree node. These new groupings of bacteria were supported by the analysis of alternative topologies in the concatenated ribosomal protein tree using the Kishino-Hasegawa test and by a census of the topologies of 132 individual groups of orthologous proteins. Additionally, the results of this analysis put into question the sister-group relationship between the two major archaeal groups, Euryarchaeota and Crenarchaeota, and suggest instead that Euryarchaeota might be a paraphyletic group with respect to Crenarchaeota. CONCLUSIONS We conclude that, the extensive horizontal gene flow and lineage-specific gene loss notwithstanding, extension of phylogenetic analysis to the genome scale has the potential of uncovering deep evolutionary relationships between prokaryotic lineages.
Collapse
Affiliation(s)
- Yuri I Wolf
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Igor B Rogozin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Nick V Grishin
- Howard Hughes Medical Institute and Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390-9050, USA
| | - Roman L Tatusov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Eugene V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
38
|
Teichmann SA, Rison SC, Thornton JM, Riley M, Gough J, Chothia C. The evolution and structural anatomy of the small molecule metabolic pathways in Escherichia coli. J Mol Biol 2001; 311:693-708. [PMID: 11518524 DOI: 10.1006/jmbi.2001.4912] [Citation(s) in RCA: 74] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The 106 small molecule metabolic (SMM) pathways in Escherichia coli are formed by the protein products of 581 genes. We can define 722 domains, nearly all of which are homologous to proteins of known structure, that form all or part of 510 of these proteins. This information allows us to answer general questions on the structural anatomy of the SMM pathway proteins and to trace family relationships and recruitment events within and across pathways. Half the gene products contain a single domain and half are formed by combinations of between two and six domains. The 722 domains belong to one of 213 families that have between one and 51 members. Family members usually conserve their catalytic or cofactor binding properties; substrate recognition is rarely conserved. Of the 213 families, members of only a quarter occur in isolation, i.e. they form single-domain proteins. Most members of the other families combine with domains from just one or two other families and a few more versatile families can combine with several different partners. Excluding isoenzymes, more than twice as many homologues are distributed across pathways as within pathways. However, serial recruitment, with two consecutive enzymes both being recruited to another pathway, is rare and recruitment of three consecutive enzymes is not observed. Only eight of the 106 pathways have a high number of homologues. Homology between consecutive pairs of enzymes with conservation of the main substrate-binding site but change in catalytic mechanism (which would support a simple model of retrograde pathway evolution) occurs only six times in the whole set of enzymes. Most of the domains that form SMM pathways have homologues in non-SMM pathways. Taken together, these results imply a pervasive "mosaic" model for the formation of protein repertoires and pathways.
Collapse
Affiliation(s)
- S A Teichmann
- Department of Biochemistry and Molecular Biology, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK.
| | | | | | | | | | | |
Collapse
|
39
|
Koonin EV, Wolf YI, Aravind L. Prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach. Genome Res 2001; 11:240-52. [PMID: 11157787 PMCID: PMC311015 DOI: 10.1101/gr.162001] [Citation(s) in RCA: 205] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
By comparing the gene order in the completely sequenced archaeal genomes complemented by sequence profile analysis, we predict the existence and protein composition of the archaeal counterpart of the eukaryotic exosome, a complex of RNAses, RNA-binding proteins, and helicases that mediates processing and 3'->5' degradation of a variety of RNA species. The majority of the predicted archaeal exosome subunits are encoded in what appears to be a previously undetected superoperon. In Methanobacterium thermoautotrophicum, this predicted superoperon consists of 15 genes; in the Crenarchaea, Sulfolobus solfataricus and Aeropyrum pernix, one and two of the genes from the superoperon, respectively, are relocated in the genome, whereas in other Euryarchaeota, the superoperon is split into a variable number of predicted operons and solitary genes. Methanococcus jannaschii partially retains the superoperon, but lacks the three core exosome subunits, and in Halobacterium sp., the superoperon is divided into two predicted operons, with the same three exosome subunits missing. This suggests concerted gene loss and an alteration of the structure and function of the predicted exosome in the Methanococcus and Halobacterium lineages. Additional potential components of the exosome are encoded by partially conserved predicted small operons. Along with the orthologs of eukaryotic exosome subunits, namely an RNase PH and two RNA-binding proteins, the predicted archaeal exosomal superoperon also encodes orthologs of two protein subunits of RNase P. This suggests a functional and possibly a physical interaction between RNase P and the postulated archaeal exosome, a connection that has not been reported in eukaryotes. In a pattern of apparent gene loss complementary to that seen in Methanococcus and Halobacterium, Thermoplasma acidophilum lacks the RNase P subunits. Unexpectedly, the identified exosomal superoperon, in addition to the predicted exosome components, encodes the catalytic subunits of the archaeal proteasome, two ribosomal proteins and a DNA-directed RNA polymerase subunit. These observations suggest that in archaea, a tight functional coupling exists between translation, RNA processing and degradation, (apparently mediated by the predicted exosome) and protein degradation (mediated by the proteasome), and may have implications for cross-talk between these processes in eukaryotes.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | | | |
Collapse
|
40
|
Tamames J. Evolution of gene order conservation in prokaryotes. Genome Biol 2001; 2:RESEARCH0020. [PMID: 11423009 PMCID: PMC33396 DOI: 10.1186/gb-2001-2-6-research0020] [Citation(s) in RCA: 137] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2001] [Revised: 04/09/2001] [Accepted: 04/12/2001] [Indexed: 11/11/2022] Open
Abstract
BACKGROUND As more complete genomes are sequenced, conservation of gene order between different organisms is emerging as an informative property of the genomes. Conservation of gene order has been used for predicting function and functional interactions of proteins, as well as for studying the evolutionary relationships between genomes. The reasons for the maintenance of gene order are still not well understood, as the organization of the prokaryote genome into operons and lateral gene transfer cannot possibly account for all the instances of conservation found. Comprehensive studies of gene order are one way of elucidating the nature of these maintaining forces. RESULTS Gene order is extensively conserved between closely related species, but rapidly becomes less conserved among more distantly related organisms, probably in a cooperative fashion. This trend could be universal in prokaryotic genomes, as archaeal genomes are likely to behave similarly to bacterial genomes. Gene order conservation could therefore be used as a valid phylogenetic measure to study relationships between species. Even between very distant species, remnants of gene order conservation exist in the form of highly conserved clusters of genes. This suggests the existence of selective processes that maintain the organization of these regions. Because the clusters often span more than one operon, common regulation probably cannot be invoked as the cause of the maintenance of gene order. CONCLUSIONS Gene order conservation is a genomic measure that can be useful for studying relationships between prokaryotes and the evolutionary forces shaping their genomes. Gene organization is extensively conserved in some genomic regions, and further studies are needed to elucidate the reason for this conservation.
Collapse
Affiliation(s)
- J Tamames
- Centro de Astrobiología, INTA/CSIC, Carretera de Ajalvir Km, 4, 28850 Torrejón de Ardoz, Madrid, Spain.
| |
Collapse
|
41
|
Abstract
Conservation of gene order in prokaryotes has become important in predicting protein function because, over the evolutionary timescale, genomes are shuffled so that local gene-order conservation reflects the functional constraints within the protein. Here, we compare closely related genomes to identify the rate with which gene order is disrupted and to infer the genes involved in the genome rearrangement.
Collapse
Affiliation(s)
- M Suyama
- EMBL, Meyerhofstr. 1, D-69012 Heidelberg, Germany
| | | |
Collapse
|
42
|
Snel B, Lehmann G, Bork P, Huynen MA. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res 2000; 28:3442-4. [PMID: 10982861 PMCID: PMC110752 DOI: 10.1093/nar/28.18.3442] [Citation(s) in RCA: 795] [Impact Index Per Article: 33.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2000] [Accepted: 08/02/2000] [Indexed: 11/14/2022] Open
Abstract
The repeated occurrence of genes in each other's neighbourhood on genomes has been shown to indicate a functional association between the proteins they encode. Here we introduce STRING (search tool for recurring instances of neighbouring genes), a tool to retrieve and display the genes a query gene repeatedly occurs with in clusters on the genome. The tool performs iterative searches and visualises the results in their genomic context. By finding the genomically associated genes for a query, it delineates a set of potentially functionally associated genes. The usefulness of STRING is illustrated with an example that suggests a functional context for an RNA methylase with unknown specificity.
Collapse
Affiliation(s)
- B Snel
- European Molecular Biology Laboratory, Meyerhofstrasse 1, D-69117 Heidelberg, Germany.
| | | | | | | |
Collapse
|
43
|
Huynen M, Snel B, Lathe W, Bork P. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res 2000; 10:1204-10. [PMID: 10958638 PMCID: PMC310926 DOI: 10.1101/gr.10.8.1204] [Citation(s) in RCA: 347] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Various new methods have been proposed to predict functional interactions between proteins based on the genomic context of their genes. The types of genomic context that they use are Type I: the fusion of genes; Type II: the conservation of gene-order or co-occurrence of genes in potential operons; and Type III: the co-occurrence of genes across genomes (phylogenetic profiles). Here we compare these types for their coverage, their correlations with various types of functional interaction, and their overlap with homology-based function assignment. We apply the methods to Mycoplasma genitalium, the standard benchmarking genome in computational and experimental genomics. Quantitatively, conservation of gene order is the technique with the highest coverage, applying to 37% of the genes. By combining gene order conservation with gene fusion (6%), the co-occurrence of genes in operons in absence of gene order conservation (8%), and the co-occurrence of genes across genomes (11%), significant context information can be obtained for 50% of the genes (the categories overlap). Qualitatively, we observe that the functional interactions between genes are stronger as the requirements for physical neighborhood on the genome are more stringent, while the fraction of potential false positives decreases. Moreover, only in cases in which gene order is conserved in a substantial fraction of the genomes, in this case six out of twenty-five, does a single type of functional interaction (physical interaction) clearly dominate (>80%). In other cases, complementary function information from homology searches, which is available for most of the genes with significant genomic context, is essential to predict the type of interaction. Using a combination of genomic context and homology searches, new functional features can be predicted for 10% of M. genitalium genes.
Collapse
Affiliation(s)
- M Huynen
- European Molecular Biology Laboratory, 69117 Heidelberg, Germany.
| | | | | | | |
Collapse
|
44
|
Comparative Genome Analysis: Exploiting the Context of Genes to Infer Evolution and Predict Function. COMPARATIVE GENOMICS 2000. [DOI: 10.1007/978-94-011-4309-7_25] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|