1
|
Martinů J, Tarabai H, Štefka J, Hypša V. Highly Resolved Genomes of Two Closely Related Lineages of the Rodent Louse Polyplax serrata with Different Host Specificities. Genome Biol Evol 2024; 16:evae045. [PMID: 38478715 PMCID: PMC10972687 DOI: 10.1093/gbe/evae045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/27/2024] [Indexed: 04/01/2024] Open
Abstract
Sucking lice of the parvorder Anoplura are permanent ectoparasites with specific lifestyle and highly derived features. Currently, genomic data are only available for a single species, the human louse Pediculus humanus. Here, we present genomes of two distinct lineages, with different host spectra, of a rodent louse Polyplax serrata. Genomes of these ecologically different lineages are closely similar in gene content and display a conserved order of genes, with the exception of a single translocation. Compared with P. humanus, the P. serrata genomes are noticeably larger (139 vs. 111 Mbp) and encode a higher number of genes. Similar to P. humanus, they are reduced in sensory-related categories such as vision and olfaction. Utilizing genome-wide data, we perform phylogenetic reconstruction and evolutionary dating of the P. serrata lineages. Obtained estimates reveal their relatively deep divergence (∼6.5 Mya), comparable with the split between the human and chimpanzee lice P. humanus and Pediculus schaeffi. This supports the view that the P. serrata lineages are likely to represent two cryptic species with different host spectra. Historical demographies show glaciation-related population size (Ne) reduction, but recent restoration of Ne was seen only in the less host-specific lineage. Together with the louse genomes, we analyze genomes of their bacterial symbiont Legionella polyplacis and evaluate their potential complementarity in synthesis of amino acids and B vitamins. We show that both systems, Polyplax/Legionella and Pediculus/Riesia, display almost identical patterns, with symbionts involved in synthesis of B vitamins but not amino acids.
Collapse
Affiliation(s)
- Jana Martinů
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
| | - Hassan Tarabai
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Central European Institute of Technology (CEITEC), University of Veterinary Sciences, Brno, Czech Republic
| | - Jan Štefka
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Institute of Parasitology, Biology Centre, The Czech Academy of Sciences, České Budějovice, Czech Republic
| | - Václav Hypša
- Department of Parasitology, Faculty of Science, University of South Bohemia, České Budějovice, Czech Republic
- Institute of Parasitology, Biology Centre, The Czech Academy of Sciences, České Budějovice, Czech Republic
| |
Collapse
|
2
|
Taylor WR. Algorithms for matching partially labelled sequence graphs. Algorithms Mol Biol 2017; 12:24. [PMID: 29021818 PMCID: PMC5613400 DOI: 10.1186/s13015-017-0115-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 09/06/2017] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND In order to find correlated pairs of positions between proteins, which are useful in predicting interactions, it is necessary to concatenate two large multiple sequence alignments such that the sequences that are joined together belong to those that interact in their species of origin. When each protein is unique then the species name is sufficient to guide this match, however, when there are multiple related sequences (paralogs) in each species then the pairing is more difficult. In bacteria a good guide can be gained from genome co-location as interacting proteins tend to be in a common operon but in eukaryotes this simple principle is not sufficient. RESULTS The methods developed in this paper take sets of paralogs for different proteins found in the same species and make a pairing based on their evolutionary distance relative to a set of other proteins that are unique and so have a known relationship (singletons). The former constitute a set of unlabelled nodes in a graph while the latter are labelled. Two variants were tested, one based on a phylogenetic tree of the sequences (the topology-based method) and a simpler, faster variant based only on the inter-sequence distances (the distance-based method). Over a set of test proteins, both gave good results, with the topology method performing slightly better. CONCLUSIONS The methods develop here still need refinement and augmentation from constraints other than the sequence data alone, such as known interactions from annotation and databases, or non-trivial relationships in genome location. With the ever growing numbers of eukaryotic genomes, it is hoped that the methods described here will open a route to the use of these data equal to the current success attained with bacterial sequences.
Collapse
|
3
|
Puggioni V, Dondi A, Folli C, Shin I, Rhee S, Percudani R. Gene Context Analysis Reveals Functional Divergence between Hypothetically Equivalent Enzymes of the Purine–Ureide Pathway. Biochemistry 2014; 53:735-45. [DOI: 10.1021/bi4010107] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Vincenzo Puggioni
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| | - Ambra Dondi
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| | - Claudia Folli
- Department
of Food Science, University of Parma, Italy
| | - Inchul Shin
- Department
of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Sangkee Rhee
- Department
of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Riccardo Percudani
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| |
Collapse
|
4
|
Lemay DG, Martin WF, Hinrichs AS, Rijnkels M, German JB, Korf I, Pollard KS. G-NEST: a gene neighborhood scoring tool to identify co-conserved, co-expressed genes. BMC Bioinformatics 2012; 13:253. [PMID: 23020263 PMCID: PMC3575404 DOI: 10.1186/1471-2105-13-253] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2012] [Accepted: 09/23/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In previous studies, gene neighborhoods-spatial clusters of co-expressed genes in the genome-have been defined using arbitrary rules such as requiring adjacency, a minimum number of genes, a fixed window size, or a minimum expression level. In the current study, we developed a Gene Neighborhood Scoring Tool (G-NEST) which combines genomic location, gene expression, and evolutionary sequence conservation data to score putative gene neighborhoods across all possible window sizes simultaneously. RESULTS Using G-NEST on atlases of mouse and human tissue expression data, we found that large neighborhoods of ten or more genes are extremely rare in mammalian genomes. When they do occur, neighborhoods are typically composed of families of related genes. Both the highest scoring and the largest neighborhoods in mammalian genomes are formed by tandem gene duplication. Mammalian gene neighborhoods contain highly and variably expressed genes. Co-localized noisy gene pairs exhibit lower evolutionary conservation of their adjacent genome locations, suggesting that their shared transcriptional background may be disadvantageous. Genes that are essential to mammalian survival and reproduction are less likely to occur in neighborhoods, although neighborhoods are enriched with genes that function in mitosis. We also found that gene orientation and protein-protein interactions are partially responsible for maintenance of gene neighborhoods. CONCLUSIONS Our experiments using G-NEST confirm that tandem gene duplication is the primary driver of non-random gene order in mammalian genomes. Non-essentiality, co-functionality, gene orientation, and protein-protein interactions are additional forces that maintain gene neighborhoods, especially those formed by tandem duplicates. We expect G-NEST to be useful for other applications such as the identification of core regulatory modules, common transcriptional backgrounds, and chromatin domains. The software is available at http://docpollard.org/software.html.
Collapse
Affiliation(s)
- Danielle G Lemay
- Genome Center, University of California Davis, 451 Health Science Dr, Davis, CA, 95616, United States of America.
| | | | | | | | | | | | | |
Collapse
|
5
|
Arnold R, Boonen K, Sun MG, Kim PM. Computational analysis of interactomes: current and future perspectives for bioinformatics approaches to model the host-pathogen interaction space. Methods 2012; 57:508-18. [PMID: 22750305 PMCID: PMC7128575 DOI: 10.1016/j.ymeth.2012.06.011] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2012] [Revised: 06/20/2012] [Accepted: 06/21/2012] [Indexed: 11/05/2022] Open
Abstract
Bacterial and viral pathogens affect their eukaryotic host partly by interacting with proteins of the host cell. Hence, to investigate infection from a systems' perspective we need to construct complete and accurate host-pathogen protein-protein interaction networks. Because of the paucity of available data and the cost associated with experimental approaches, any construction and analysis of such a network in the near future has to rely on computational predictions. Specifically, this challenge consists of a number of sub-problems: First, prediction of possible pathogen interactors (e.g. effector proteins) is necessary for bacteria and protozoa. Second, the prospective host binding partners have to be determined and finally, the impact on the host cell analyzed. This review gives an overview of current bioinformatics approaches to obtain and understand host-pathogen interactions. As an application example of the methods covered, we predict host-pathogen interactions of Salmonella and discuss the value of these predictions as a prospective for further research.
Collapse
Affiliation(s)
- Roland Arnold
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada M5S 3E1
| | - Kurt Boonen
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada M5S 3E1
| | - Mark G.F. Sun
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada M5S 3E1
| | - Philip M. Kim
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada M5S 3E1
- Banting and Best Department of Medical Research, University of Toronto, Toronto, ON, Canada M5S 3E1
- Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada M5S 3E1
- Department of Computer Science, University of Toronto, Toronto, ON, Canada M5S 3E1
| |
Collapse
|
6
|
Chen Y, Mao F, Li G, Xu Y. Genome-wide discovery of missing genes in biological pathways of prokaryotes. BMC Bioinformatics 2011; 12 Suppl 1:S1. [PMID: 21342538 PMCID: PMC3044263 DOI: 10.1186/1471-2105-12-s1-s1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
Abstract
Collapse
Affiliation(s)
- Yong Chen
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA.
| | | | | | | |
Collapse
|
7
|
Ovacik MA, Androulakis IP. Enzyme sequence similarity improves the reaction alignment method for cross-species pathway comparison. Toxicol Appl Pharmacol 2010; 271:363-71. [PMID: 20851138 DOI: 10.1016/j.taap.2010.09.009] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2010] [Revised: 08/24/2010] [Accepted: 09/10/2010] [Indexed: 11/30/2022]
Abstract
Pathway-based information has become an important source of information for both establishing evolutionary relationships and understanding the mode of action of a chemical or pharmaceutical among species. Cross-species comparison of pathways can address two broad questions: comparison in order to inform evolutionary relationships and to extrapolate species differences used in a number of different applications including drug and toxicity testing. Cross-species comparison of metabolic pathways is complex as there are multiple features of a pathway that can be modeled and compared. Among the various methods that have been proposed, reaction alignment has emerged as the most successful at predicting phylogenetic relationships based on NCBI taxonomy. We propose an improvement of the reaction alignment method by accounting for sequence similarity in addition to reaction alignment method. Using nine species, including human and some model organisms and test species, we evaluate the standard and improved comparison methods by analyzing glycolysis and citrate cycle pathways conservation. In addition, we demonstrate how organism comparison can be conducted by accounting for the cumulative information retrieved from nine pathways in central metabolism as well as a more complete study involving 36 pathways common in all nine species. Our results indicate that reaction alignment with enzyme sequence similarity results in a more accurate representation of pathway specific cross-species similarities and differences based on NCBI taxonomy.
Collapse
Affiliation(s)
- Meric A Ovacik
- Chemical and Biochemical Engineering Department, Rutgers University, Piscataway, NJ 08854, USA
| | | |
Collapse
|
8
|
Ling X, He X, Xin D. Detecting gene clusters under evolutionary constraint in a large number of genomes. ACTA ACUST UNITED AC 2009; 25:571-7. [PMID: 19158161 DOI: 10.1093/bioinformatics/btp027] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Spatial clusters of genes conserved across multiple genomes provide important clues to gene functions and evolution of genome organization. Existing methods of identifying these clusters often made restrictive assumptions, such as exact conservation of gene order, and relied on heuristic algorithms. RESULTS We developed a very efficient algorithm based on a 'gene teams' model that allows genes in the clusters to appear in different orders. This allows us to detect conserved gene clusters under flexible evolutionary constraints in a large number of genomes. Our statistical evaluation incorporates the evolutionary relationship among genomes, a key aspect that has been missing in most previous studies. We conducted a large-scale analysis of 133 bacterial genomes. Our results confirm that our approach is an effective way of uncovering functionally related genes. The comparison with known operons and the analysis of the structural properties of our predicted clusters suggest that operons are an important source of constraint, but there are also other forces that determine evolution of gene order and arrangement. Using our method, we predicted functions of many poorly characterized genes in bacterial. The combined algorithmic and statistical methods we present here provide a rigorous framework for systematically studying evolutionary constraints of genomic contexts. AVAILABILITY The software, data and the full results of this article are available online at http://www.ews.uiuc.edu/~xuling/mcmusec.
Collapse
Affiliation(s)
- Xu Ling
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA.
| | | | | |
Collapse
|
9
|
Abstract
Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.
Collapse
Affiliation(s)
- Umar Syed
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | | |
Collapse
|
10
|
Karimpour-Fard A, Leach SM, Gill RT, Hunter LE. Predicting protein linkages in bacteria: which method is best depends on task. BMC Bioinformatics 2008; 9:397. [PMID: 18816389 PMCID: PMC2570368 DOI: 10.1186/1471-2105-9-397] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2008] [Accepted: 09/24/2008] [Indexed: 01/06/2023] Open
Abstract
Background Applications of computational methods for predicting protein functional linkages are increasing. In recent years, several bacteria-specific methods for predicting linkages have been developed. The four major genomic context methods are: Gene cluster, Gene neighbor, Rosetta Stone, and Phylogenetic profiles. These methods have been shown to be powerful tools and this paper provides guidelines for when each method is appropriate by exploring different features of each method and potential improvements offered by their combination. We also review many previous treatments of these prediction methods, use the latest available annotations, and offer a number of new observations. Results Using Escherichia coli K12 and Bacillus subtilis, linkage predictions made by each of these methods were evaluated against three benchmarks: functional categories defined by COG and KEGG, known pathways listed in EcoCyc, and known operons listed in RegulonDB. Each evaluated method had strengths and weaknesses, with no one method dominating all aspects of predictive ability studied. For functional categories, as previous studies have shown, the Rosetta Stone method was individually best at detecting linkages and predicting functions among proteins with shared KEGG categories while the Phylogenetic profile method was best for linkage detection and function prediction among proteins with common COG functions. Differences in performance under COG versus KEGG may be attributable to the presence of paralogs. Better function prediction was observed when using a weighted combination of linkages based on reliability versus using a simple unweighted union of the linkage sets. For pathway reconstruction, 99 complete metabolic pathways in E. coli K12 (out of the 209 known, non-trivial pathways) and 193 pathways with 50% of their proteins were covered by linkages from at least one method. Gene neighbor was most effective individually on pathway reconstruction, with 48 complete pathways reconstructed. For operon prediction, Gene cluster predicted completely 59% of the known operons in E. coli K12 and 88% (333/418)in B. subtilis. Comparing two versions of the E. coli K12 operon database, many of the unannotated predictions in the earlier version were updated to true predictions in the later version. Using only linkages found by both Gene Cluster and Gene Neighbor improved the precision of operon predictions. Additionally, as previous studies have shown, combining features based on intergenic region and protein function improved the specificity of operon prediction. Conclusion A common problem for computational methods is the generation of a large number of false positives that might be caused by an incomplete source of validation. By comparing two versions of a database, we demonstrated the dramatic differences on reported results. We used several benchmarks on which we have shown the comparative effectiveness of each prediction method, as well as provided guidelines as to which method is most appropriate for a given prediction task.
Collapse
Affiliation(s)
- Anis Karimpour-Fard
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado 80045, USA.
| | | | | | | |
Collapse
|
11
|
Tetko IV, Rodchenkov IV, Walter MC, Rattei T, Mewes HW. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. ACTA ACUST UNITED AC 2008; 24:621-8. [PMID: 18174184 DOI: 10.1093/bioinformatics/btm633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat.
Collapse
Affiliation(s)
- Igor V Tetko
- Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Neuherberg, Germany.
| | | | | | | | | |
Collapse
|
12
|
Wu H, Mao F, Olman V, Xu Y. Hierarchical classification of functionally equivalent genes in prokaryotes. Nucleic Acids Res 2007; 35:2125-40. [PMID: 17353185 PMCID: PMC1874638 DOI: 10.1093/nar/gkl1114] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2006] [Revised: 11/15/2006] [Accepted: 12/06/2006] [Indexed: 11/20/2022] Open
Abstract
Functional classification of genes represents a fundamental problem to many biological studies. Most of the existing classification schemes are based on the concepts of homology and orthology, which were originally introduced to study gene evolution but might not be the most appropriate for gene function prediction, particularly at high resolution level. We have recently developed a scheme for hierarchical classification of genes (HCGs) in prokaryotes. In the HCG scheme, the functional equivalence relationships among genes are first assessed through a careful application of both sequence similarity and genomic neighborhood information; and genes are then classified into a hierarchical structure of clusters, where genes in each cluster are functionally equivalent at some resolution level, and the level of resolution goes higher as the clusters become increasingly smaller traveling down the hierarchy. The HCG scheme is validated through comparisons with the taxonomy of the prokaryotic genomes, Clusters of Orthologous Groups (COGs) of genes and the Pfam system. We have applied the HCG scheme to 224 complete prokaryotic genomes, and constructed a HCG database consisting of a forest of 5339 multi-level and 15 770 single-level trees of gene clusters covering approximately 93% of the genes of these 224 genomes. The validation results indicate that the HCG scheme not only captures the key features of the existing classification schemes but also provides a much richer organization of genes which can be used for functional prediction of genes at higher resolution and to help reveal evolutionary trace of the genes.
Collapse
Affiliation(s)
| | | | | | - Ying Xu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
13
|
Abstract
We present here a method to identify microsyntenies across several genomes. This method adopts the innovative approach of deconstructing proteins into their domains. This allows the detection of strings of domains that are conserved in their content, but not necessarily in their order, that we refer to as domain teams or syntenies of domains. The prominent feature of the method is that it relaxes the rigidity of the orthology criterion and avoids many of the pitfalls of gene families identification methods, often hampered by multidomain proteins or low levels of sequence similarity. This approach, that allows both inter- and intrachromosomal comparisons, proves to be more sensitive than the classical methods based on pairwise sequence comparisons, particularly in the simultaneous treatment of many species. The automated and fast detection of domain teams is implemented in the DomainTeam software. In this chapter, we describe the procedure to run DomainTeam. After formatting the input and setting up the parameters, running the algorithm produces an output file comprising all the syntenies of domains shared by two or more (sometimes all) of the compared genomes.
Collapse
|
14
|
Zheng Y, Anton BP, Roberts RJ, Kasif S. Phylogenetic detection of conserved gene clusters in microbial genomes. BMC Bioinformatics 2005; 6:243. [PMID: 16202130 PMCID: PMC1266350 DOI: 10.1186/1471-2105-6-243] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2005] [Accepted: 10/03/2005] [Indexed: 11/15/2022] Open
Abstract
Background Microbial genomes contain an abundance of genes with conserved proximity forming clusters on the chromosome. However, the conservation can be a result of many factors such as vertical inheritance, or functional selection. Thus, identification of conserved gene clusters that are under functional selection provides an effective channel for gene annotation, microarray screening, and pathway reconstruction. The problem of devising a robust method to identify these conserved gene clusters and to evaluate the significance of the conservation in multiple genomes has a number of implications for comparative, evolutionary and functional genomics as well as synthetic biology. Results In this paper we describe a new method for detecting conserved gene clusters that incorporates the information captured by a genome phylogenetic tree. We show that our method can overcome the common problem of overestimation of significance due to the bias in the genome database and thereby achieve better accuracy when detecting functionally connected gene clusters. Our results can be accessed at database GeneChords . Conclusion The methodology described in this paper gives a scalable framework for discovering conserved gene clusters in microbial genomes. It serves as a platform for many other functional genomic analyses in microorganisms, such as operon prediction, regulatory site prediction, functional annotation of genes, evolutionary origin and development of gene clusters.
Collapse
Affiliation(s)
- Yu Zheng
- Bioinformatics Graduate Program, Boston University, Boston, MA, USA
| | - Brian P Anton
- Bioinformatics Graduate Program, Boston University, Boston, MA, USA
- New England Biolabs, Beverly, MA, USA
| | | | - Simon Kasif
- Bioinformatics Graduate Program, Boston University, Boston, MA, USA
- Department of Biomedical Engineering, Boston University, Boston, MA, USA
- Center for Advanced Genomic Technology, Boston University, Boston, MA, USA
| |
Collapse
|
15
|
Ravagnani A, Finan CL, Young M. A novel firmicute protein family related to the actinobacterial resuscitation-promoting factors by non-orthologous domain displacement. BMC Genomics 2005; 6:39. [PMID: 15774001 PMCID: PMC1084345 DOI: 10.1186/1471-2164-6-39] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2004] [Accepted: 03/17/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In Micrococcus luteus growth and resuscitation from starvation-induced dormancy is controlled by the production of a secreted growth factor. This autocrine resuscitation-promoting factor (Rpf) is the founder member of a family of proteins found throughout and confined to the actinobacteria (high G + C Gram-positive bacteria). The aim of this work was to search for and characterise a cognate gene family in the firmicutes (low G + C Gram-positive bacteria) and obtain information about how they may control bacterial growth and resuscitation. RESULTS In silico analysis of the accessory domains of the Rpf proteins permitted their classification into several subfamilies. The RpfB subfamily is related to a group of firmicute proteins of unknown function, represented by YabE of Bacillus subtilis. The actinobacterial RpfB and firmicute YabE proteins have very similar domain structures and genomic contexts, except that in YabE, the actinobacterial Rpf domain is replaced by another domain, which we have called Sps. Although totally unrelated in both sequence and secondary structure, the Rpf and Sps domains fulfil the same function. We propose that these proteins have undergone "non-orthologous domain displacement", a phenomenon akin to "non-orthologous gene displacement" that has been described previously. Proteins containing the Sps domain are widely distributed throughout the firmicutes and they too fall into a number of distinct subfamilies. Comparative analysis of the accessory domains in the Rpf and Sps proteins, together with their weak similarity to lytic transglycosylases, provide clear evidence that they are muralytic enzymes. CONCLUSIONS The results indicate that the firmicute Sps proteins and the actinobacterial Rpf proteins are cognate and that they control bacterial culturability via enzymatic modification of the bacterial cell envelope.
Collapse
Affiliation(s)
- Adriana Ravagnani
- Institute of Biological Sciences, University of Wales, Aberystwyth, Ceredigion SY23 3DD, UK
| | - Christopher L Finan
- Institute of Biological Sciences, University of Wales, Aberystwyth, Ceredigion SY23 3DD, UK
| | - Michael Young
- Institute of Biological Sciences, University of Wales, Aberystwyth, Ceredigion SY23 3DD, UK
| |
Collapse
|
16
|
Korbel JO, Jensen LJ, von Mering C, Bork P. Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat Biotechnol 2005; 22:911-7. [PMID: 15229555 DOI: 10.1038/nbt988] [Citation(s) in RCA: 136] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Several widely used methods for predicting functional associations between proteins are based on the systematic analysis of genomic context. Efforts are ongoing to improve these methods and to search for novel aspects in genomes that could be exploited for function prediction. Here, we use gene expression data to demonstrate two functional implications of genome organization: first, chromosomal proximity indicates gene coregulation in prokaryotes independent of relative gene orientation; and second, adjacent bidirectionally transcribed genes (that is,'divergently' organized coding regions) with conserved gene orientation are strongly coregulated. We further demonstrate that such bidirectionally transcribed gene pairs are functionally associated and derive from this a novel genomic context method that reliably predicts links between >2,500 pairs of genes in approximately 100 species. Around 650 of these functional associations are supported by other genomic context methods. In most instances, one gene encodes a transcriptional regulator, and the other a nonregulatory protein. In-depth analysis in Escherichia coli shows that the vast majority of these regulators both control transcription of the divergently transcribed target gene/operon and auto-regulate their own biosynthesis. The method thus enables the prediction of target processes and regulatory features for several hundred transcriptional regulators.
Collapse
Affiliation(s)
- Jan O Korbel
- European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117 Heidelberg, Germany
| | | | | | | |
Collapse
|
17
|
Wang H, Cronan JE. Only One of the Two Annotated Lactococcus lactis fabG Genes Encodes a Functional β-Ketoacyl−Acyl Carrier Protein Reductase. Biochemistry 2004; 43:11782-9. [PMID: 15362862 DOI: 10.1021/bi0487600] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
The small genome of the Gram-positive bacterium Lactococcus lactis ssp. lactis IL1403 contains two genes that encode proteins annotated as homologues of Escherichia coli beta-hydroxyacyl-acyl carrier protein (ACP) reductase. E. coli fabG encodes beta-ketoacyl-acyl carrier protein (ACP) reductase, the enzyme responsible for the first reductive step of the fatty acid synthetic cycle. Both of the L. lactis genes are adjacent to (and predicted to be cotranscribed with) other genes that encode proteins having homology to known fatty acid synthetic enzymes. Such relationships have often been used to strengthen annotations based on sequence alignments. Annotation in the case of beta-ketoacyl-ACP reductase is particularly problematic because the protein is a member of a vast protein family, the short-chain alcohol dehydrogenase/reductase (SDR) family. The recent isolation of an E. coli fabG mutant strain encoding a conditionally active beta-ketoacyl-ACP reductase allowed physiological and biochemical testing of the putative L. lactishomologues. We report that expression of only one of the two L. lactis proteins (that annotated as FabG1) allows growth of the E. coli fabG strain under nonpermissive conditions and restores in vitro fatty acid synthetic ability to extracts of the mutant strain. Therefore, like E. coli, L. lactis has a single beta-ketoacyl-ACP reductase active with substrates of all fatty acid chain lengths. The second protein (annotated as FabG2), although inactive in fatty acid synthesis both in vivo and in vitro, was highly active in reduction of the model substrate, beta-ketobutyryl-CoA. As expected from work on the E. coli enzyme, the FabG1 beta-ketobutyryl-CoA reductase activity was inhibited by ACP (which blocks access to the active site) whereas the activity of FabG2 was unaffected by the presence of ACP. These results seem to be an example of a gene duplication event followed by divergence of one copy of the gene to encode a protein having a new function.
Collapse
Affiliation(s)
- Haihong Wang
- Department of Microbiology, University of Illinois, Urbana, Illinois 61801, USA
| | | |
Collapse
|
18
|
Wong P, Houry WA. Chaperone networks in bacteria: analysis of protein homeostasis in minimal cells. J Struct Biol 2004; 146:79-89. [PMID: 15037239 DOI: 10.1016/j.jsb.2003.11.006] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2003] [Revised: 10/30/2003] [Indexed: 12/01/2022]
Abstract
The prevention of aberrant behavior of proteins is fundamental to cellular life. Protein homeostatic processes are present in cells to stabilize protein conformations, refold misfolded proteins, and degrade proteins that might be detrimental to the cell. Molecular chaperones and proteases perform a major role in these processes. In bacteria, the main cytoplasmic components involved in protein homeostasis include the chaperones trigger factor, DnaK/DnaJ/GrpE, GroEL/GroES, HtpG, as well as ClpB and the proteases ClpXP, ClpAP, HslUV, Lon, and FtsH. Based on recent genome sequencing efforts, it was surprising to find that the Mycoplasma, a genus proposed to include a minimal form of cellular life, do not contain certain major members of the protein homeostatic network, including GroEL/GroES. We propose that, in mycoplasmas, there has been a fundamental shift towards favoring processes that promote protein degradation rather than protein folding. The arguments are based on two different premises: (1) the regulation of stress response in Mycoplasma and (2) the unique characteristics of the Mycoplasma proteome.
Collapse
Affiliation(s)
- Philip Wong
- Department of Biochemistry, University of Toronto, 1 King's College Circle, Medical Sciences Building, Toronto, Ont., Canada M5S 1A8
| | | |
Collapse
|
19
|
Simeonidis E, Rison SCG, Thornton JM, Bogle IDL, Papageorgiou LG. Analysis of metabolic networks using a pathway distance metric through linear programming. Metab Eng 2003; 5:211-9. [PMID: 12948755 DOI: 10.1016/s1096-7176(03)00043-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The solution of the shortest path problem in biochemical systems constitutes an important step for studies of their evolution. In this paper, a linear programming (LP) algorithm for calculating minimal pathway distances in metabolic networks is studied. Minimal pathway distances are identified as the smallest number of metabolic steps separating two enzymes in metabolic pathways. The algorithm deals effectively with circularity and reaction directionality. The applicability of the algorithm is illustrated by calculating the minimal pathway distances for Escherichia coli small molecule metabolism enzymes, and then considering their correlations with genome distance (distance separating two genes on a chromosome) and enzyme function (as characterised by enzyme commission number). The results illustrate the effectiveness of the LP model. In addition, the data confirm that propinquity of genes on the genome implies similarity in function (as determined by co-involvement in the same region of the metabolic network), but suggest that no correlation exists between pathway distance and enzyme function. These findings offer insight into the probable mechanism of pathway evolution.
Collapse
Affiliation(s)
- Evangelos Simeonidis
- Department of Chemical Engineering, Centre for Process Systems Engineering, UCL, London, WC1E 7JE, UK
| | | | | | | | | |
Collapse
|
20
|
Abstract
Genomic clustering of genes in a pathway is commonly found in prokaryotes due to transcriptional operons, but these are not present in most eukaryotes. Yet, there might be clustering to a lesser extent of pathway members in eukaryotic genomes, that assist coregulation of a set of functionally cooperating genes. We analyzed five sequenced eukaryotic genomes for clustering of genes assigned to the same pathway in the KEGG database. Between 98% and 30% of the analyzed pathways in a genome were found to exhibit significantly higher clustering levels than expected by chance. In descending order by the level of clustering, the genomes studied were Saccharomyces cerevisiae, Homo sapiens, Caenorhabditis elegans, Arabidopsis thaliana, and Drosophila melanogaster. Surprisingly, there is not much agreement between genomes in terms of which pathways are most clustered. Only seven of 69 pathways found in all species were significantly clustered in all five of them. This species-specific pattern of pathway clustering may reflect adaptations or evolutionary events unique to a particular lineage. We note that although operons are common in C. elegans, only 58% of the pathways showed significant clustering, which is less than in human. Virtually all pathways in S. cerevisiae showed significant clustering.
Collapse
Affiliation(s)
- Jennifer M Lee
- Center for Genomics and Bioinformatics, Karolinska Institutet, S171 77 Stockholm, Sweden
| | | |
Collapse
|
21
|
Frishman D, Mokrejs M, Kosykh D, Kastenmüller G, Kolesov G, Zubrzycki I, Gruber C, Geier B, Kaps A, Albermann K, Volz A, Wagner C, Fellenberg M, Heumann K, Mewes HW. The PEDANT genome database. Nucleic Acids Res 2003; 31:207-11. [PMID: 12519983 PMCID: PMC165452 DOI: 10.1093/nar/gkg005] [Citation(s) in RCA: 96] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The PEDANT genome database (http://pedant.gsf.de) provides exhaustive automatic analysis of genomic sequences by a large variety of established bioinformatics tools through a comprehensive Web-based user interface. One hundred and seventy seven completely sequenced and unfinished genomes have been processed so far, including large eukaryotic genomes (mouse, human) published recently. In this contribution, we describe the current status of the PEDANT database and novel analytical features added to the PEDANT server in 2002. Those include: (i) integration with the BioRS data retrieval system which allows fast text queries, (ii) pre-computed sequence clusters in each complete genome, (iii) a comprehensive set of tools for genome comparison, including genome comparison tables and protein function prediction based on genomic context, and (iv) computation and visualization of protein-protein interaction (PPI) networks based on experimental data. The availability of functional and structural predictions for 650 000 genomic proteins in well organized form makes PEDANT a useful resource for both functional and structural genomics.
Collapse
Affiliation(s)
- Dmitrij Frishman
- Institute for Bioinformatics, GSF - National Research Center for Environment and Health, Ingolstädter Landstrasse 1, 85764 Neueherberg, Germany.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Abstract
Small-molecule metabolism forms the core of the metabolic processes of all living organisms. As early as 1945, possible mechanisms for the evolution of such a complex metabolic system were considered. The problem is to explain the appearance and development of a highly regulated complex network of interacting proteins and substrates from a limited structural and functional repertoire. By permitting the co-analysis of phylogeny and metabolism, the combined exploitation of pathway and structural databases, as well as the use of multiple-sequence alignment search algorithms, sheds light on this problem. Much of the current research suggests a chemistry-driven 'patchwork' model of pathway evolution, but other mechanisms may play a role. In the future, as metabolic structure and sequence space are further explored, it should become easier to trace the finer details of pathway development and understand how complexity has evolved.
Collapse
Affiliation(s)
- Stuart C G Rison
- Department of Biochemistry and Molecular Biology, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK
| | | |
Collapse
|
23
|
Rogozin IB, Makarova KS, Murvai J, Czabarka E, Wolf YI, Tatusov RL, Szekely LA, Koonin EV. Connected gene neighborhoods in prokaryotic genomes. Nucleic Acids Res 2002; 30:2212-23. [PMID: 12000841 PMCID: PMC115289 DOI: 10.1093/nar/30.10.2212] [Citation(s) in RCA: 130] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
A computational method was developed for delineating connected gene neighborhoods in bacterial and archaeal genomes. These gene neighborhoods are not typically present, in their entirety, in any single genome, but are held together by overlapping, partially conserved gene arrays. The procedure was applied to comparing the orders of orthologous genes, which were extracted from the database of Clusters of Orthologous Groups of proteins (COGs), in 31 prokaryotic genomes and resulted in the identification of 188 clusters of gene arrays, which included 1001 of 2890 COGs. These clusters were projected onto actual genomes to produce extended neighborhoods including additional genes, which are adjacent to the genes from the clusters and are transcribed in the same direction, which resulted in a total of 2387 COGs being included in the neighborhoods. Most of the neighborhoods consist predominantly of genes united by a coherent functional theme, but also include a minority of genes without an obvious functional connection to the main theme. We hypothesize that although some of the latter genes might have unsuspected roles, others are maintained within gene arrays because of the advantage of expression at a level that is typical of the given neighborhood. We designate this phenomenon 'genomic hitchhiking'. The largest neighborhood includes 79 genes (COGs) and consists of overlapping, rearranged ribosomal protein superoperons; apparent genome hitchhiking is particularly typical of this neighborhood and other neighborhoods that consist of genes coding for translation machinery components. Several neighborhoods involve previously undetected connections between genes, allowing new functional predictions. Gene neighborhoods appear to evolve via complex rearrangement, with different combinations of genes from a neighborhood fixed in different lineages.
Collapse
Affiliation(s)
- Igor B Rogozin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | | | | | | | | | | | |
Collapse
|
24
|
Rison SCG, Teichmann SA, Thornton JM. Homology, pathway distance and chromosomal localization of the small molecule metabolism enzymes in Escherichia coli. J Mol Biol 2002; 318:911-32. [PMID: 12054833 DOI: 10.1016/s0022-2836(02)00140-7] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Here, we analyse Escherichia coli enzymes involved in small molecule metabolism (SMM). We introduce the concept of pathway distance as a measure of the number of distinct metabolic steps separating two SMM enzymes, and we consider protein homology (as determined by assigning enzymes to structural and sequence families) and gene interval (the number of genes separating two genes on the E. coli chromosome). The relationships between these three contexts (pathway distance, homology and chromosomal localisation) is investigated extensively. We make use of these relationships to suggest possible SMM evolution mechanisms. Homology between enzyme pairs close in the SMM was higher than expected by chance but was still rare. When observed, homologues usually conserved their reaction mechanism and/or co-factor binding rather than shared substrate binding. The correlation between pathway distance and gene intervals was clear. Enzymes catalysing nearby SMM reactions were usually encoded by genes close by on the E. coli chromosome. We found many co-regulated blocks of three to four genes (usually non-homologous) encoding enzymes occurring within four metabolic steps of one another; nearly all of these blocks formed part of known or predicted operons. The "inline reuse" of enzymes (i.e. the use of the same enzyme to catalyse two or more different steps of a metabolic pathway) is also discussed: of these enzymes, four were multifunctional (i.e. catalysed a different reaction in each instance), nine had multiple substrate specificity (i.e. catalysed the same reaction on different substrates in each instance) and one catalysed the same reaction on the same substrate but as part of two different complexes. We also identified 59 sets of isozymic proteins most commonly duplicated to function under different conditions, or with a different preferred substrate or minor substrate. In addition to transcriptional units, isozymes and inline reuse of enzymes provide mechanisms for controlling the SMM network. Our data suggest that several pathway evolution mechanisms may occur in concert, although chemistry-driven duplication/recruitment is favoured. SMM exploits regulatory strategies involving chromosomal location, isozymes and the reuse of enzymes.
Collapse
Affiliation(s)
- Stuart C G Rison
- Department of Biochemistry and Molecular Biology, University College London, Darwin Building, Gower Street, London WC1E 6BT, UK
| | | | | |
Collapse
|
25
|
Joachimiak MP, Cohen FE. JEvTrace: refinement and variations of the evolutionary trace in JAVA. Genome Biol 2002; 3:RESEARCH0077. [PMID: 12537566 PMCID: PMC151179 DOI: 10.1186/gb-2002-3-12-research0077] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2002] [Revised: 07/11/2002] [Accepted: 10/21/2002] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Details of functional speciation within gene families can be difficult to identify using standard multiple sequence alignment (MSA) methods. The evolutionary trace (ET) was developed as a visualization tool to combine MSA, phylogenetic and structural data for identification of functional sites in proteins. The method has been successful in extracting evolutionary details of functional surfaces in a number of biological systems and modifications of the method are useful in creating hypotheses about the function of previously unannotated genes. We wish to facilitate the graphical interpretation of disparate data types through the creation of flexible software implementations. RESULTS We have implemented the ET method in a JAVA graphical interface, JEvTrace. Users can analyze and visualize ET input and output with respect to protein phylogeny, sequence and structure. Function discovery with JEvTrace is demonstrated on two proteins with recently determined crystal structures: YlxR from Streptococcus pneumoniae with a predicted RNA-binding function, and a Haemophilus influenzae protein of unknown function, YbaK. To facilitate analysis and storage of results we propose a MSA coloring data structure. The sequence coloring format readily captures evolutionary, biological, functional and structural features of MSAs. CONCLUSIONS Protein families and phylogeny represent complex data with statistical outliers and special cases. The JEvTrace implementation of the ET method allows detailed mining and graphical visualization of evolutionary sequence relationships.
Collapse
Affiliation(s)
- Marcin P Joachimiak
- Graduate Group in Biophysics, University of California San Francisco, San Francisco, CA 94143-0450, USA
- Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, CA 94143-0450, USA
| | - Fred E Cohen
- Graduate Group in Biophysics, University of California San Francisco, San Francisco, CA 94143-0450, USA
- Department of Cellular and Molecular Pharmacology, University of California San Francisco, San Francisco, CA 94143-0450, USA
| |
Collapse
|