1
|
Proteogenomic Analysis Provides Novel Insight into Genome Annotation and Nitrogen Metabolism in Nostoc sp. PCC 7120. Microbiol Spectr 2021; 9:e0049021. [PMID: 34523988 PMCID: PMC8557916 DOI: 10.1128/spectrum.00490-21] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Cyanobacteria, capable of oxygenic photosynthesis, play a vital role in nitrogen and carbon cycles. Nostoc sp. PCC 7120 (Nostoc 7120) is a model cyanobacterium commonly used to study cell differentiation and nitrogen metabolism. Although its genome was released in 2002, a high-quality genome annotation remains unavailable for this model cyanobacterium. Therefore, in this study, we performed an in-depth proteogenomic analysis based on high-resolution mass spectrometry (MS) data to refine the genome annotation of Nostoc 7120. We unambiguously identified 5,519 predicted protein-coding genes and revealed 26 novel genes, 75 revised genes, and 27 different kinds of posttranslational modifications in Nostoc 7120. A subset of these novel proteins were further validated at both the mRNA and peptide levels. Functional analysis suggested that many newly annotated proteins may participate in nitrogen or cadmium/mercury metabolism in Nostoc 7120. Moreover, we constructed an updated Nostoc 7120 database based on our proteogenomic results and presented examples of how the updated database could be used to improve the annotation of proteomic data. Our study provides the most comprehensive annotation of the Nostoc 7120 genome thus far and will serve as a valuable resource for the study of nitrogen metabolism in Nostoc 7120. IMPORTANCE Cyanobacteria are a large group of prokaryotes capable of oxygenic photosynthesis and play a vital role in nitrogen and carbon cycles on Earth. Nostoc 7120 is a commonly used model cyanobacterium for studying cell differentiation and nitrogen metabolism. In this study, we presented the first comprehensive draft map of the Nostoc 7120 proteome and a wide range of posttranslational modifications. In addition, we constructed an updated database of Nostoc 7120 based on our proteogenomic results and presented examples of how the updated database could be used for system-level studies of Nostoc 7120. Our study provides the most comprehensive annotation of Nostoc 7120 genome and a valuable resource for the study of nitrogen metabolism in this model cyanobacterium.
Collapse
|
2
|
Abstract
The gene identification problem is the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding genes. This problem is of self-evident importance, and is far from being fully solved, particularly for higher eukaryotes. Thus it is not surprising that the number of algorithm and software developers working in the area is rapidly increasing. The present paper is an overview of the field, with an emphasis on eukaryotes, for such developers.
Collapse
Affiliation(s)
- J W Fickett
- Theoretical Biology and Biophysics Group, MS K710, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| |
Collapse
|
3
|
Abstract
SUMMARY Searches of translated, unannotated genomic DNA sequences against protein databases is a useful early-stage method for discovering protein homologues encoded by the sequence, but generates huge amounts of output data that quickly become impregnable. BlastXtract is a web-based tool for managing and visualizing results from large translated BLAST and FastA searches. It combines the speed and storage benefits of relational database management systems with an easy-to-use graphical navigation map, and greatly facilitates the early exploration of genomic sequence. AVAILABILITY BlastXtract can be downloaded from http://bioinfo.ucc.ie/blastxtract/.
Collapse
Affiliation(s)
- Marcus J Claesson
- Alimentary Pharmabiotic Centre and Department of Microbiology, National University of Ireland, Cork, Ireland.
| | | |
Collapse
|
4
|
|
5
|
Abstract
Gene identification, also known as gene finding or gene recognition, is among the important problems of molecular biology that have been receiving increasing attention with the advent of large scale sequencing projects. Previous strategies for solving this problem can be categorized into essentially two schools of thought: one school employs sequence composition statistics, whereas the other relies on database similarity searches. In this paper, we propose a new gene identification scheme that combines the best characteristics from each of these two schools. In particular, our method determines gene candidates among the ORFs that can be identified in a given DNA strand through the use of the Bio-Dictionary, a database of patterns that covers essentially all of the currently available sample of the natural protein sequence space. Our approach relies entirely on the use of redundant patterns as the agents on which the presence or absence of genes is predicated and does not employ any additional evidence, e.g. ribosome-binding site signals. The Bio-Dictionary Gene Finder (BDGF), the algorithm's implementation, is a single computational engine able to handle the gene identification task across distinct archaeal and bacterial genomes. The engine exhibits performance that is characterized by simultaneous very high values of sensitivity and specificity, and a high percentage of correctly predicted start sites. Using a collection of patterns derived from an old (June 2000) release of the Swiss-Prot/TrEMBL database that contained 451 602 proteins and fragments, we demonstrate our method's generality and capabilities through an extensive analysis of 17 complete archaeal and bacterial genomes. Examples of previously unreported genes are also shown and discussed in detail.
Collapse
Affiliation(s)
- Tetsuo Shibuya
- Exploratory Technology, IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan
| | | |
Collapse
|
6
|
Bocs S, Danchin A, Médigue C. Re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes. BMC Bioinformatics 2002; 3:5. [PMID: 11879526 PMCID: PMC77393 DOI: 10.1186/1471-2105-3-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2001] [Accepted: 02/05/2002] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. RESULTS We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. CONCLUSIONS The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).
Collapse
Affiliation(s)
- Stéphanie Bocs
- Laboratoire Génome et Informatique, Université de Versailles, 91034 Evry Cedex, France
| | - Antoine Danchin
- HKU-Pasteur Research Center, Pokfulam, Hong-Kong
- Génétique des Génomes Bactériens, Institut Pasteur, 75724 Paris Cedex 15, France
| | - Claudine Médigue
- Génétique des Génomes Bactériens, Institut Pasteur, 75724 Paris Cedex 15, France
| |
Collapse
|
7
|
Abstract
In the process of analysing the four available complete archaeal genomes, we have noted that certain regions characterised as 'non-coding' exhibit significant sequence similarity to other protein sequences from Archaea and other species. Using established technology, we have identified a number of potential protein coding regions in these putative 'non-coding' regions. We have detected 524 such cases, of which 113 regions appear to code for proteins present in archaeal or other species, while the remaining 411 regions are mostly start/stop definition conflicts. Of the 113 protein coding regions, only 21 code for proteins with homologues of known function. The number of novel coding sequences identified herein amounts to 1. 5% of the total genome entries, while the conflicting cases represent an additional 5%. The observed differences between the four complete archaeal genomes seem to reflect disparate approaches to genome annotation. Genome sequence collections should be regularly checked to improve gene prediction by sequence similarity and greater effort is required to make gene definitions consistent across related species.
Collapse
Affiliation(s)
- S Raghavan
- Computational Genomics Group, Research Programme, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
| | | |
Collapse
|
8
|
Hayes WS, Borodovsky M. How to interpret an anonymous bacterial genome: machine learning approach to gene identification. Genome Res 1998; 8:1154-71. [PMID: 9847079 DOI: 10.1101/gr.8.11.1154] [Citation(s) in RCA: 92] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
In this report we address the problem of accurate statistical modeling of DNA sequences, either coding or noncoding, for a bacterial species whose genome (or a large portion) was sequenced but not yet characterized experimentally. Availability of these models is critical for successful solution of the genome annotation task by statistical methods of gene finding. We present the method, GeneMark-Genesis, which learns the parameters of Markov models of protein-coding and noncoding regions from anonymous bacterial genomic sequence. These models are subsequently used in the GeneMark and GeneMark.hmm gene-finding programs. Although there is basically one model of a noncoding region for a given genome, several models of protein-coding region are automatically obtained by GeneMark-Genesis. The diversity of protein-coding models reflects the diversity of oligonucleotide compositions, particularly the diversity of codon usage strategies observed in genes from one and the same genome. In the simplest and the most important case, there are just two gene models-typical and atypical ones. We show that the atypical model allows one to predict genes that escape identification by the typical model. Many genes predicted by the atypical model appear to be horizontally transferred genes. The early versions of GeneMark-Genesis were used for annotating the genomes of Methanoccocus jannaschii and Helicobacter pylori. We report the results of accuracy testing of the full-scale version of GeneMark-Genesis on 10 completely sequenced bacterial genomes. Interestingly, the GeneMark.hmm program that employed the typical and atypical models defined by GeneMark-Genesis was able to predict 683 new atypical genes with 176 of them confirmed by similarity search.
Collapse
Affiliation(s)
- W S Hayes
- School of Biology, Georgia Institute of Technology, Atlanta, Georgia 30332-0230, USA
| | | |
Collapse
|
9
|
Affiliation(s)
- J W Fickett
- SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania, USA
| |
Collapse
|
10
|
Frishman D, Mironov A, Mewes HW, Gelfand M. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res 1998; 26:2941-7. [PMID: 9611239 PMCID: PMC147632 DOI: 10.1093/nar/26.12.2941] [Citation(s) in RCA: 141] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Analysis of a newly sequenced bacterial genome starts with identification of protein-coding genes. Functional assignment of proteins requires the exact knowledge of protein N-termini. We present a new program ORPHEUS that identifies candidate genes and accurately predicts gene starts. The analysis starts with a database similarity search and identification of reliable gene fragments. The latter are used to derive statistical characteristics of protein-coding regions and ribosome-binding sites and to predict the complete set of genes in the analyzed genome. In a test on Bacillus subtilis and Escherichia coli genomes, the program correctly identified 93.3% (resp. 96.3%) of experimentally annotated genes longer than 100 codons described in the PIR-International database, and for these genes 96.3% (83.9%) of starts were predicted exactly. Furthermore, 98.9% (99.1%) of genes longer than 100 codons annotated in GenBank were found, and 92.9% (75.7%) of predicted starts coincided with the feature table description. Finally, for the complete gene complements of B.subtilis and E.coli , including genes shorter than 100 codons, gene prediction accuracy was 88.9 and 87.1%, respectively, with 94.2 and 76.7% starts coinciding with the existing annotation.
Collapse
Affiliation(s)
- D Frishman
- Munich Information Center for Protein Sequences (MIPS) of the German National Center for Health and Environment (GSF), Am Klopferspitz 18a, 82152 Martinsried, Germany.
| | | | | | | |
Collapse
|
11
|
|
12
|
Smith DR, Richterich P, Rubenfield M, Rice PW, Butler C, Lee HM, Kirst S, Gundersen K, Abendschan K, Xu Q, Chung M, Deloughery C, Aldredge T, Maher J, Lundstrom R, Tulig C, Falls K, Imrich J, Torrey D, Engelstein M, Breton G, Madan D, Nietupski R, Seitz B, Connelly S, McDougall S, Safer H, Gibson R, Doucette-Stamm L, Eiglmeier K, Bergh S, Cole ST, Robison K, Richterich L, Johnson J, Church GM, Mao JI. Multiplex sequencing of 1.5 Mb of the Mycobacterium leprae genome. Genome Res 1997; 7:802-19. [PMID: 9267804 DOI: 10.1101/gr.7.8.802] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The nucleotide sequence of 1.5 Mb of genomic DNA from Mycobacterium leprae was determined using computer-assisted multiplex sequencing technology. This brings the 2.8-Mb M. leprae genome sequence to approximately 66% completion. The sequences, derived from 43 recombinant cosmids, contain 1046 putative protein-coding genes, 44 repetitive regions, 3 tRNAs, and 15 tRNAs. The gene density of one per 1.4 kb is slightly lower than that of Mycoplasma (1.2 kb). Of the protein coding genes, 44% have significant matches to genes with well-defined functions. Comparison of 1157 M. leprae and 1564 Mycobacterium tuberculosis proteins shows a complex mosaic of homologous genomic blocks with up to 22 adjacent proteins in conserved map order. Matches to known enzymatic, antigenic, membrane, cell wall, cell division, multidrug resistance, and virulence proteins suggest therapeutic and vaccine targets. Unusual features of the M. leprae genome include large polyketide synthase (pks) operons, inteins, and highly fragmented pseudogenes.
Collapse
Affiliation(s)
- D R Smith
- Genome Therapeutics Corporation, Collaborative Research Division, Waltham, Massachusetts 02154, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
13
|
|
14
|
Nölling J, Reeve JN. Growth- and substrate-dependent transcription of the formate dehydrogenase (fdhCAB) operon in Methanobacterium thermoformicicum Z-245. J Bacteriol 1997; 179:899-908. [PMID: 9006048 PMCID: PMC178775 DOI: 10.1128/jb.179.3.899-908.1997] [Citation(s) in RCA: 40] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
The formate dehydrogenase-encoding fdhCAB operon and flanking genes have been cloned and sequenced from Methanobacterium thermoformicicum Z-245. fdh transcription was shown to be initiated 21 bp upstream from fdhC, although most fdh transcripts terminated or were processed between fdhC and fdhA. The resulting fdhC, fdhAB, and fdhCAB transcripts were present at all growth stages in cells growing on formate but were barely detectable during early exponential growth on H2 plus CO2. The levels of the fdh transcripts did, however, increase dramatically in cells growing on H2 plus CO2, coincident with the decrease in the growth rate and the onset of constant methanogenesis that occurred when culture densities reached an optical density at 600 nm of approximately 0.5. The mth transcript that encodes the H2-dependent methenyl-H4 MPT reductase (MTH) and the frh and mvh transcripts that encode the coenzyme F420-reducing (FRH) and nonreducing (MVH) hydrogenases, respectively, were also present in cells growing on formate, consistent with the synthesis of three hydrogenases, MTH, FRH, and MVH, in the absence of exogenously supplied H2. Reducing the H2 supply to M. thermoformicicum cells growing on H2 plus CO2 reduced the growth rate and CH4 production but increased frh and fdh transcription and also increased transcription of the mtd, mer, and mcr genes that encode enzymes that catalyze steps 4, 5, and 7, respectively, in the pathway of CO2 reduction to CH4. Reducing the H2 supply to a level insufficient for growth resulted in the disappearance of all methane gene transcripts except the mcr transcript, which increased. Regions flanking the fdhCAB operon in M. thermoformicicum Z-245 were used as probes to clone the homologous region from the Methanobacterium thermoautotrophicum deltaH genome. Sequencing revealed the presence of very similar genes except that the genome of M. thermoautotrophicum, a methanogen incapable of growth on formate, lacked the fdhCAB operon.
Collapse
Affiliation(s)
- J Nölling
- Department of Microbiology, The Ohio State University, Columbus 43210, USA
| | | |
Collapse
|
15
|
Koonin EV, Mushegian AR. Complete genome sequences of cellular life forms: glimpses of theoretical evolutionary genomics. Curr Opin Genet Dev 1996; 6:757-62. [PMID: 8994848 DOI: 10.1016/s0959-437x(96)80032-3] [Citation(s) in RCA: 49] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
The availability of complete genome sequences of cellular life forms creates the opportunity to explore the functional content of the genomes and evolutionary relationships between them at a new qualitative level. With the advent of these sequences, the construction of a minimal gene set sufficient for sustaining cellular life and reconstruction of the genome of the last common ancestor of bacteria, eukaryotes, and archaea become realistic, albeit challenging, research projects. A version of the minimal gene set for modern-type cellular life derived by comparative analysis of two bacterial genomes, those of Haemophilus influenzae and Mycoplasma genitalium, consists of approximately 250 genes. A comparison of the protein sequences encoded in these genes with those of the proteins encoded in the complete yeast genome suggests that the last common ancestor of all extant life might have had an RNA genome.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | |
Collapse
|
16
|
|
17
|
Alvarez CE, Robison K, Gilbert W. Novel Gq alpha isoform is a candidate transducer of rhodopsin signaling in a Drosophila testes-autonomous pacemaker. Proc Natl Acad Sci U S A 1996; 93:12278-82. [PMID: 8901571 PMCID: PMC37981 DOI: 10.1073/pnas.93.22.12278] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
DGq is the alpha subunit of the heterotrimeric GTPase (G alpha), which couples rhodopsin to phospholipase C in Drosophila vision. We have uncovered three duplicated exons in dgq by scanning the GenBank data base for unrecognized coding sequences. These alternative exons encode sites involved in GTPase activity and G beta-binding, NorpA (phospholipase C)-binding, and rhodopsin-binding. We examined the in vivo splicing of dgq in adult flies and find that, in all but the male gonads, only two isoforms are expressed. One, dgqA, is the original visual isoform and is expressed in eyes, ocelli, brain, and male gonads. The other, dgqB, has the three novel exons and is widely expressed. Remarkably, all three nonvisual B exons are highly similar (82% identity at the amino acid level) to the Gq alpha family consensus, from Caenorhabditis elegans to human, but all three visual A exons are divergent (61% identity). Intriguingly, we have found a third isoform, dgqC, which is specifically and abundantly expressed in male gonads, and shares the divergent rhodopsin-binding exon of dgqA. We suggest that DGqC is a candidate for the light-signal transducer of a testes-autonomous photosensory clock. This proposal is supported by the finding that rhodopsin 2 and arrestin 1, two photoreceptor-cell-specific genes, are also expressed in male gonads.
Collapse
Affiliation(s)
- C E Alvarez
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | | | | |
Collapse
|
18
|
Berben G. Nitrobacter winogradskyi cytochrome c oxidase genes are organized in a repeated gene cluster. Antonie Van Leeuwenhoek 1996; 69:305-15. [PMID: 8836428 DOI: 10.1007/bf00399619] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
Cytochrome c oxidase (EC 1.9.3.1) is one of the components of the electron transport chain by which Nitrobacter, a facultative lithoautotrophic bacterium, recovers energy from nitrite oxidation. The genes encoding the two catalytic core subunits of the enzyme were isolated from a Nitrobacter winogradskyi gene library. Sequencing of one of the 14 cloned DNA segments revealed that the subunit genes are side by side in an operon-like cluster. Remarkably the cluster appears to be present in at least two copies per genome. It extends over a 5-6 kb length including, besides the catalytic core subunit genes, other cytochrome oxidase related genes, especially a heme O synthase gene. Noteworthy is the new kind of gene order identified within the cluster. Deduced sequences for the cytochrome oxidase subunits and for the heme O synthase look closest to their counterparts in other alpha-subdivision Proteobacteria, particularly the Rhizobiaceae. This confirms the phylogenetic relationships established only upon 16S rRNA data. Furthermore, interesting similarities exist between N. winogradskyi and mitochondrial cytochrome oxidase subunits while the heme O synthase sequence gives some new insights about the other similar published alpha-subdivision proteobacterial sequences.
Collapse
Affiliation(s)
- G Berben
- Laboratoire de Microbiologie, Centre de Recherches Agronomiques, Gembloux, Belgium
| |
Collapse
|
19
|
Jovanovic G, Weiner L, Model P. Identification, nucleotide sequence, and characterization of PspF, the transcriptional activator of the Escherichia coli stress-induced psp operon. J Bacteriol 1996; 178:1936-45. [PMID: 8606168 PMCID: PMC177889 DOI: 10.1128/jb.178.7.1936-1945.1996] [Citation(s) in RCA: 112] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
The phage shock protein (psp) operon (pspABCE) of Escherichia coli is strongly induced in response to a variety of stressful conditions or agents such as filamentous phage infection, ethanol treatment, osmotic shock, heat shock, and prolonged incubation in stationary phase. Transcription of the psp operon is driven from a sigma54 promoter and stimulated by integration host factor. We report here the identification of a transcriptional activator gene, designated pspF, which controls expression of the psp operon in E. coli. The pspF gene was identified by random miniTn10-tet transposon mutagenesis. Insertion of the transposon into the pspF gene abolished sigma54-dependent induction of the psp operon. The pspF gene is closely linked to the psp operon and is divergently transcribed from one major and two minor sigma 70 promoters, pspF encodes a 37-kDa protein which belongs to the enhancer-binding protein family of sigma54 transcriptional activators. PspF contains a catalytic domain, which in other sigma54 activators would be the central domain, and a C-terminal DNA-binding domain but entirely lacks an N-terminal regulatory domain and is constitutively active. The insertion mutant pspF::mTn10-tet (pspF877) encodes a truncated protein (PspF delta HTH) that lacks the DNA-binding helix-turn-helix (HTH) motif. Although the central catalytic domain is intact, PspF delta HTH at physiological concentration cannot activate psp expression. In the absence of inducing stimuli, multicopy-plasmid-borne PspF or PspF delta HTH overcomes repression of the psp operon mediated by the negative regulator PspA.
Collapse
Affiliation(s)
- G Jovanovic
- Rockefeller University, New York, New York 10021, USA
| | | | | |
Collapse
|
20
|
Abstract
The complete sequences of two small bacterial genomes have recently become available, and those of several more species should follow within the next two years. Sequence comparisons show that the most bacterial proteins are highly conserved in evolution, allowing predictions to be made about the functions of most products of an uncharacterized genome. Bacterial genomes differ vastly in their gene repertoires. Although genes for components of the translation and transcription machinery, and for molecular chaperones, are typically maintained, many regulatory and metabolic systems are absent in bacteria with small genomes. Mycoplasma genitalium, with the smallest known genome of any cellular life form, lacks virtually all known regulatory genes, and its gene expression may be regulated differently than in other bacteria. Genome organization is evolutionarily labile: extensive gene shuffling leaves only very few conserved gene arrays in distantly related bacteria.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
21
|
Tatusov RL, Mushegian AR, Bork P, Brown NP, Hayes WS, Borodovsky M, Rudd KE, Koonin EV. Metabolism and evolution of Haemophilus influenzae deduced from a whole-genome comparison with Escherichia coli. Curr Biol 1996; 6:279-91. [PMID: 8805245 DOI: 10.1016/s0960-9822(02)00478-5] [Citation(s) in RCA: 207] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
BACKGROUND The 1.83 Megabase (Mb) sequence of the Haemophilus influenzae chromosome, the first completed genome sequence of a cellular life form, has been recently reported. Approximately 75 % of the 4.7 Mb genome sequence of Escherichia coli is also available. The life styles of the two bacteria are very different - H. influenzae is an obligate parasite that lives in human upper respiratory mucosa and can be cultivated only on rich media, whereas E. coli is a saprophyte that can grow on minimal media. A detailed comparison of the protein products encoded by these two genomes is expected to provide valuable insights into bacterial cell physiology and genome evolution. RESULTS We describe the results of computer analysis of the amino-acid sequences of 1703 putative proteins encoded by the complete genome of H. influenzae. We detected sequence similarity to proteins in current databases for 92 % of the H. influenzae protein sequences, and at least a general functional prediction was possible for 83 %. A comparison of the H. influenzae protein sequences with those of 3010 proteins encoded by the sequenced 75 % of the E. coli genome revealed 1128 pairs of apparent orthologs, with an average of 59 % identity. In contrast to the high similarity between orthologs, the genome organization and the functional repertoire of genes in the two bacteria were remarkably different. The smaller genome size of H. influenzae is explained, to a large extent, by a reduction in the number of paralogous genes. There was no long range colinearity between the E. coli and H. influenzae gene orders, but over 70 % of the orthologous genes were found in short conserved strings, only about half of which were operons in E. coli. Superposition of the H. influenzae enzyme repertoire upon the known E. coli metabolic pathways allowed us to reconstruct similar and alternative pathways in H. influenzae and provides an explanation for the known nutritional requirements. CONCLUSIONS By comparing proteins encoded by the two bacterial genomes, we have shown that extensive gene shuffling and variation in the extent of gene paralogy are major trends in bacterial evolution; this comparison has also allowed us to deduce crucial aspects of the largely uncharacterized metabolism of H. influenzae.
Collapse
Affiliation(s)
- R L Tatusov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | | | | | | | | | | | |
Collapse
|
22
|
Affiliation(s)
- Keith Robison
- Department of Molecular and Cellular Biology, Harvard Biological Laboratories, Cambridge, MA 02138, USA
| | - Walter Gilbert
- Department of Molecular and Cellular Biology, Harvard Biological Laboratories, Cambridge, MA 02138, USA
| | - George M. Church
- Department of Genetics, Harvard Medical School, Warren Alpert Building, Boston, MA 02115, USA
| |
Collapse
|
23
|
Abstract
An adequate set of computer procedures tailored to address the task of genome-scale analysis of protein sequences will greatly increase the beneficial impact of the genome sequencing projects on the progress of biological research. This is especially pertinent given the fact that, for model organisms, one-half or more of the putative gene products have not been functionally characterized. Here we described several programs that may comprise the core of such a set and their application to the analysis of about 3000 proteins comprising 75% of the E. coli gene products. We find that the protein sequences encoded in this model genome are a rich source of information, with biologically relevant similarities detected for more than 80% of them. In the majority of cases, these similarities become evident directly from the results of BLAST searches. However, methods for motif analysis provide for a significant increase in search sensitivity and are particularly important for the detection of ancient conserved regions. As a result of sequence similarity analysis, generalized functional predictions can be made for the majority of uncharacterized ORF products, allowing efficient focusing of experimental effort. Clustering of the E. coli proteins on the basis of sequence similarity shows that almost one-half of the bacterial proteins have at least one paralog and that the likelihood that a protein belongs to a small or a large cluster depends on the function of this particular protein.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | | | | |
Collapse
|
24
|
Liao X, Charlebois I, Ouellet C, Morency MJ, Dewar K, Lightfoot J, Foster J, Siehnel R, Schweizer H, Lam JS, Hancock REW, Levesque RC. Physical mapping of 32 genetic markers on the Pseudomonas aeruginosa PAO1 chromosome. MICROBIOLOGY (READING, ENGLAND) 1996; 142 ( Pt 1):79-86. [PMID: 8581173 DOI: 10.1099/13500872-142-1-79] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
The Pseudomonas aeruginosa chromosome was fractionated with the enzymes SpeI and DpnI, and genomic fragments were separated by PFGE and used for mapping a collection of 40 genes. This permitted the localization of 8 genes previously mapped and of 32 genes which had not been mapped. We showed that a careful search of databases and identification of sequences that were homologous to known genes could be used to design and synthesize DNA probes for the mapping of P. aeruginosa homologues by Southern hybridization with genomic fragments, resulting in definition of the locations of the aro-2, dapB, envA, mexA, groEL, oprH, oprM, oprP, ponA, rpoB and rpoH genetic markers. In addition, a combination of distinct DNA sources were utilized as radioactively labelled probes, including specific restriction fragments of the cloned genes (glpD, opdE, oprH, oprO, oprP, phoS), DNA fragments prepared by PCR, and single-stranded DNA prepared from phagemid libraries that had been randomly sequenced. We used a PCR approach to clone fragments of the putative yhhF, sucC, sucD, cypH, pbpB, murE, pbpC, soxR, ftsA, ftsZ and envA genes. Random sequencing of P. aeruginosa DNA from phagemid libraries and database searching permitted the cloning of sequences from the acoA, catR, hemD, pheS, proS, oprD, pyo and rpsB gene homologues. The described genomic methods permit the rapid mapping of the P. aeruginosa genome without linkage analysis.
Collapse
MESH Headings
- Base Sequence
- Chromosomes, Bacterial/genetics
- Cloning, Molecular
- DNA, Bacterial/genetics
- DNA, Bacterial/metabolism
- DNA, Complementary/genetics
- Deoxyribonucleases, Type II Site-Specific/metabolism
- Electrophoresis, Gel, Pulsed-Field
- Gene Expression
- Genes, Bacterial
- Genetic Markers
- Molecular Sequence Data
- Oligonucleotide Probes
- Polymerase Chain Reaction
- Pseudomonas aeruginosa/genetics
- Restriction Mapping
- Sequence Analysis, DNA
Collapse
Affiliation(s)
- Xiaowen Liao
- Department of Microbiology and Immunology, University of British Columbia, 300-6174 University Boulevard, Vancouver BC, Canada V6T 1Z3
| | - Isabelle Charlebois
- Microbiologie Moléculaire et Génie des Protéines, Département de Microbiologie, Faculté de Médecine, Pavillon Charles-Eugène-Marchand, Université Laval, Ste-Foy, Québec, Canada G1K 7P4
| | - Catherine Ouellet
- Microbiologie Moléculaire et Génie des Protéines, Département de Microbiologie, Faculté de Médecine, Pavillon Charles-Eugène-Marchand, Université Laval, Ste-Foy, Québec, Canada G1K 7P4
| | - Marie-Josée Morency
- Microbiologie Moléculaire et Génie des Protéines, Département de Microbiologie, Faculté de Médecine, Pavillon Charles-Eugène-Marchand, Université Laval, Ste-Foy, Québec, Canada G1K 7P4
| | - Ken Dewar
- Microbiologie Moléculaire et Génie des Protéines, Département de Microbiologie, Faculté de Médecine, Pavillon Charles-Eugène-Marchand, Université Laval, Ste-Foy, Québec, Canada G1K 7P4
| | - Jeff Lightfoot
- Microbiologie Moléculaire et Génie des Protéines, Département de Microbiologie, Faculté de Médecine, Pavillon Charles-Eugène-Marchand, Université Laval, Ste-Foy, Québec, Canada G1K 7P4
| | - Jennifer Foster
- Department of Microbiology, University of Guelph, Guelph, Ontario, Canada N1G 2W1
| | - Richard Siehnel
- Department of Microbiology and Immunology, University of British Columbia, 300-6174 University Boulevard, Vancouver BC, Canada V6T 1Z3
| | - Herbert Schweizer
- Department of Medical Microbiology and Infectious Diseases, University of Calgary, Calgary, Alberta, Canada T2N 4N1
| | - Joseph S Lam
- Department of Microbiology, University of Guelph, Guelph, Ontario, Canada N1G 2W1
| | - Robert E W Hancock
- Department of Microbiology and Immunology, University of British Columbia, 300-6174 University Boulevard, Vancouver BC, Canada V6T 1Z3
| | - Roger C Levesque
- Microbiologie Moléculaire et Génie des Protéines, Département de Microbiologie, Faculté de Médecine, Pavillon Charles-Eugène-Marchand, Université Laval, Ste-Foy, Québec, Canada G1K 7P4
| |
Collapse
|
25
|
Koonin EV, Tatusov RL, Rudd KE. Sequence similarity analysis of Escherichia coli proteins: functional and evolutionary implications. Proc Natl Acad Sci U S A 1995; 92:11921-5. [PMID: 8524875 PMCID: PMC40515 DOI: 10.1073/pnas.92.25.11921] [Citation(s) in RCA: 82] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
A computer analysis of 2328 protein sequences comprising about 60% of the Escherichia coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A high fraction of E. coli proteins--86%--shows significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For > 90% of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10% could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest superclusters--namely, permeases, ATPases and GTPases with the conserved "Walker-type" motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50% of all ACR-containing protein families represented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evolutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are implicated in human diseases.
Collapse
Affiliation(s)
- E V Koonin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | | | |
Collapse
|
26
|
Borodovsky M, McIninch JD, Koonin EV, Rudd KE, Médigue C, Danchin A. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res 1995; 23:3554-62. [PMID: 7567469 PMCID: PMC307237 DOI: 10.1093/nar/23.17.3554] [Citation(s) in RCA: 96] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Open
Abstract
We further investigated the statistical features of the three classes of Escherichia coli genes that have been previously delineated by factorial correspondence analysis and dynamic clustering methods. A phased Markov model for a nucleotide sequence of each gene class was developed and employed for gene prediction using the GeneMark program. The protein-coding region prediction accuracy was determined for class-specific Markov models of different orders when the programs implementing these models were applied to gene sequences from the same or other classes. It is shown that at least two training sets and two program versions derived for different classes of E. coli genes are necessary in order to achieve a high accuracy of coding region prediction for uncharacterized sequences. Some annotated E. coli genes from Class I and Class III are shown to be spurious, whereas many open reading frames (ORFs) that have not been annotated in GenBank as genes are predicted to encode proteins. The amino acid sequences of the putative products of these ORFs initially did not show similarity to already known proteins. However, conserved regions have been identified in several of them by screening the latest entries in protein sequence databases and applying methods for motif search, while some other of these new genes have been identified in independent experiments.
Collapse
Affiliation(s)
- M Borodovsky
- School of Biology, Georgia Institute of Technology, Atlanta 30332, USA
| | | | | | | | | | | |
Collapse
|
27
|
Abstract
We present edition VIII of the genetic map of Salmonella typhimurium LT2. We list a total of 1,159 genes, 1,080 of which have been located on the circular chromosome and 29 of which are on pSLT, the 90-kb plasmid usually found in LT2 lines. The remaining 50 genes are not yet mapped. The coordinate system used in this edition is neither minutes of transfer time in conjugation crosses nor units representing "phage lengths" of DNA of the transducing phage P22, as used in earlier editions, but centisomes and kilobases based on physical analysis of the lengths of DNA segments between genes. Some of these lengths have been determined by digestion of DNA by rare-cutting endonucleases and separation of fragments by pulsed-field gel electrophoresis. Other lengths have been determined by analysis of DNA sequences in GenBank. We have constructed StySeq1, which incorporates all Salmonella DNA sequence data known to us. StySeq1 comprises over 548 kb of nonredundant chromosomal genomic sequences, representing 11.4% of the chromosome, which is estimated to be just over 4,800 kb in length. Most of these sequences were assigned locations on the chromosome, in some cases by analogy with mapped Escherichia coli sequences.
Collapse
Affiliation(s)
- K E Sanderson
- Department of Biological Sciences, University of Calgary, Alberta, Canada
| | | | | |
Collapse
|
28
|
Darcy TJ, Sandman K, Reeve JN. Methanobacterium formicicum, a mesophilic methanogen, contains three HFo histones. J Bacteriol 1995; 177:858-60. [PMID: 7836329 PMCID: PMC176673 DOI: 10.1128/jb.177.3.858-860.1995] [Citation(s) in RCA: 25] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
The mesophilic methanogen Methanobacterium formicicum JF-1 has been shown to contain three members of the HMf family of archaeal histones, designated HFoA1, HFoA2, and HFoB, and their encodinig genes (hfoA1, hfoA2, and hfoB) have been cloned and sequenced. The HFo histones have primary sequences that are 75 to 82% identical to the HMf sequences and appear to share ancestry with the core histones that form the eukaryal nucleosome. The HFo proteins bind and compact DNA molecules into nucleosome-like structures apparently identical to those formed by the HMf proteins, but, in contrast to the HMf proteins, this activity of the HFo proteins is lost after incubation at 95 degrees C for 5 h.
Collapse
Affiliation(s)
- T J Darcy
- Department of Microbiology, Ohio State University, Columbus 43210
| | | | | |
Collapse
|
29
|
Abstract
Primary sequence patterns based on known conserved sites in eukaryotic protein kinases were used to search for eukaryotic-like protein kinase sequences in a six-frame translation of the bacterial subsection of GenBank. This search identified a previously unrecognized eukaryotic-like protein kinase gene in three related methanogenic archaebacteria, Methanococcus vannielii, M. voltae, and M. thermolithotrophicus. The proposed coding sequences are located in orthologous open reading frames (ORFs): ORF547, ORF294, and ORF114, respectively. The C-terminus of the ORFs contains 9 of the 11 subdomains characteristically conserved within the eukaryotic protein kinase catalytic domain. The N-terminus of the ORFs is similar to a putative glycoprotease in Pasteurella haemolytica and its homologue in Escherichia coli, the orfX gene. This is the first report of a eukaryotic-like protein kinase sequence observed in Archaebacteria.
Collapse
Affiliation(s)
- R F Smith
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
| | | |
Collapse
|
30
|
Borodovsky M, Rudd KE, Koonin EV. Intrinsic and extrinsic approaches for detecting genes in a bacterial genome. Nucleic Acids Res 1994; 22:4756-67. [PMID: 7984428 PMCID: PMC308528 DOI: 10.1093/nar/22.22.4756] [Citation(s) in RCA: 80] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins.
Collapse
Affiliation(s)
- M Borodovsky
- School of Biology, Georgia Institute of Technology, Atlanta 30332-0230
| | | | | |
Collapse
|