1
|
Danchin A, Ouzounis C, Tokuyasu T, Zucker JD. No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects. Microb Biotechnol 2018; 11:588-605. [PMID: 29806194 PMCID: PMC6011933 DOI: 10.1111/1751-7915.13284] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Science and engineering rely on the accumulation and dissemination of knowledge to make discoveries and create new designs. Discovery-driven genome research rests on knowledge passed on via gene annotations. In response to the deluge of sequencing big data, standard annotation practice employs automated procedures that rely on majority rules. We argue this hinders progress through the generation and propagation of errors, leading investigators into blind alleys. More subtly, this inductive process discourages the discovery of novelty, which remains essential in biological research and reflects the nature of biology itself. Annotation systems, rather than being repositories of facts, should be tools that support multiple modes of inference. By combining deduction, induction and abduction, investigators can generate hypotheses when accurate knowledge is extracted from model databases. A key stance is to depart from 'the sequence tells the structure tells the function' fallacy, placing function first. We illustrate our approach with examples of critical or unexpected pathways, using MicroScope to demonstrate how tools can be implemented following the principles we advocate. We end with a challenge to the reader.
Collapse
Affiliation(s)
- Antoine Danchin
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, Hong Kong University, 21 Sassoon Road, Pokfulam, Hong Kong
| | - Christos Ouzounis
- Biological Computation and Process Laboratory, Centre for Research and Technology Hellas, Chemical Process and Energy Resources Institute, Thessalonica, 57001, Greece
| | - Taku Tokuyasu
- Shenzhen Institutes of Advanced Technology, Institute of Synthetic Biology, Shenzhen University Town, 1068 Xueyuan Avenue, Shenzhen, China
| | - Jean-Daniel Zucker
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
| |
Collapse
|
2
|
Kröger M, Wahl R. Compilation of DNA sequences of Escherichia coli K12: description of the interactive databases ECD and ECDC. Nucleic Acids Res 1998; 26:46-9. [PMID: 9399797 PMCID: PMC147217 DOI: 10.1093/nar/26.1.46] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
We have compiled the DNA sequence data for Escherichia coli K12 available from the GenBank and EMBL data libraries and independently from the literature. We provide the most definitive version of the ECD Escherichia coli database now exclusively via the World Wide Web System (http://susi.bio.uni-giessen.de/ecdc.html ). Our database encloses the completed genome sequence recently published by two competing groups and an assembled set of all elder sequences. The organisation of the database allows precise physical location of each individual gene or regulatory region, even taking into consideration discrepancies in nomenclature. The WWW program allows to the user to branch into the original EMBL and SWISS-PROT datafiles. A number of links to other WWW servers dealing with E. coli is provided. A FASTA and BLAST search may be performed online. Besides the WWW format a flat file version may be obtained via ftp. A number of discrepancies between the two systematic sequence determinations and/or the literature have not yet been resolved. However, our database may serve as a reference source for resolution and/or the assignment of strain difference.
Collapse
Affiliation(s)
- M Kröger
- Institut für Mikrobiologie und Molekularbiologie, Fachbereich Biologie, Justus-Liebig-Universität Giessen, Frankfurter Strasse 107, D-35392 Giessen, Germany.
| | | |
Collapse
|
3
|
Kröger M, Wahl R. Compilation of DNA sequences of Escherichia coli K12: description of the interactive databases ECD and ECDC (update 1996). Nucleic Acids Res 1997; 25:39-42. [PMID: 9016501 PMCID: PMC146385 DOI: 10.1093/nar/25.1.39] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
We have compiled the DNA sequence data forEscherichia coliavailable from the GenBank and EMBL data libraries and independently from the literature. We provide the most definitive version of the ECDEscherichia colidatabase now exclusively via the World Wide Web System: http://susi.bio.uni-giessen.de/usr/local/www/ html/ecdc.html . Our database encloses an assembled set of contiguous sequences. Each of these contigs compiles all available sequence information, including those derived from a variety of elder sequences. The organisation of the database allows precise physical location of each individual gene or regulatory region, even taking into consideration discrepancies in nomenclature. The WWW program allows to branch into the original EMBL and SWISSPROT datafiles. A number of links to other WWW servers is provided. A FASTA and BLAST search may be performed online. Besides the WWW format a flat file version may be obtained via ftp. The ftp version may also be obtained from the EMBL data library as part of the CD-ROM issue of the EMBL sequence database, which is released and updated every 3 months. After deletion of all detected overlaps a total of 3 588 706 individual bp has been determined up to the end of September 1996. This corresponds to a total of 77.09% of the entire E.coli chromosome consisting of approximately 4655 kb. About 479 kb (10.3%) are additionally available from Kyoto (Japan). Another 94 kb (2%) are available, but mapping has not been confirmed. Thus the total may have reached 89.4%.
Collapse
Affiliation(s)
- M Kröger
- Institut für Mikrobiologie und Molekularbiologie, Fachbereich Biologie, Justus-Liebig-Universität Giessen, Frankfurter Strasse 107, D-35392 Giessen, Germany.
| | | |
Collapse
|
4
|
Kröger M, Wahl R. Compilation of DNA sequences of Escherichia coli K12 (ECD and ECDC; update 1995). Nucleic Acids Res 1996; 24:29-31. [PMID: 8594594 PMCID: PMC145621 DOI: 10.1093/nar/24.1.29] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023] Open
Abstract
We have compiled the DNA sequence data for Escherichia coli available from the GenBank and EMBL data libraries and independently from the literature. Unlike the previous updates of our E.coli databases, we provide the most recent version preferentially via the World Wide Web System (use URL: http://susi.bio.unigiessen.de/usr/local/www++ +/html/ecdc.html). Our database includes an assembled set of contiguous sequences. Each of these contigs compiles all available sequence information, including those derived from a variety of elder sequences. The organization of the database allows one to find the exact physical location of each individual gene or regulatory region, even regarding discrepancies in nomenclature. The WWW program allows access into the original EMBL and SWISSPROT datafiles. A FASTA and BLAST search may be performed online. Besides the WWW format a flat file version may be obtained via ftp. The complete compilation, including a full set of genetic map data and the E.coli protein index, can be obtained in machine readable form from the EMBL data library as a part of the CD-ROM issue of the EMBL sequence database, released and updated every three months. After deletion of all detected overlaps a total of 3 333 878 individual bp was determined by the end of September 1995. This corresponds to a total of 71.71% of the entire E.coli chromosome consisting of about 4720 kbp. About 94 kbp (2%) are available additionally, but have not yet been definitely mapped.
Collapse
Affiliation(s)
- M Kröger
- Institut für Mikrobiologie und Molekularbiologie, Fachbereich Biologie, Justus-Liebig-Universitat Gieben, Germany
| | | |
Collapse
|
5
|
Wahl R, Kröger M. ECDC--a totally integrated and interactively usable genetic map of Escherichia coli K12. Microbiol Res 1995; 150:7-61. [PMID: 7735721 DOI: 10.1016/s0944-5013(11)80034-0] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
A printed version of the interactively usable genetic map of Escherichia coli K12 is provided together with some statistical information about the actual status of the respective genome sequencing project. A total of 3,179,967 bp corresponding to 68.38% of the genome is available through the ECDC database. Contigs as well as individual DNA sequences for each gene or open reading frame are provided. Access to a number of other databases is possible using World Wide Web or local programs.
Collapse
Affiliation(s)
- R Wahl
- Institut für Mikrobiologie und Molekularbiologie, Justus-Liebig-Universität Giessen, Germany
| | | |
Collapse
|
6
|
Wahl R, Rice P, Rice CM, Kröger M. ECD--a totally integrated database of Escherichia coli K12. Nucleic Acids Res 1994; 22:3450-5. [PMID: 7937044 PMCID: PMC308300 DOI: 10.1093/nar/22.17.3450] [Citation(s) in RCA: 28] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
We have compiled the DNA sequence data for E. coli available from the GENBANK and EMBL data libraries and independently from the literature. Starting with this update of our Escherichia coli database (ECD release 20) we provide major changes compared to previous issues. This update not only represents another substantial increase in sequence information, it also allows now to find the exact physical location of each individual gene or regulatory region, even regarding discrepancies in nomenclature. In order to save space this printed version does not contain the database itself anymore, but we provide several examples. The complete database is publically available in electronic form together with a self explaining application program or as a flat file. The complete compilation including a full set of genetic map data and the E. coli protein index can be obtained in machine readable form from the EMBL data library as a part of the CD-ROM issue of the EMBL sequence database, released and updated every three months. After deletion of all detected overlaps a total of 2,878,364 individual bp is found to be determined till the end of June 1994. This corresponds to a total of 60.98% of the entire E. coli chromosome consisting of about 4,720 kbp. This number may actually be higher by 9161 bp derived from other strains of E. coli.
Collapse
Affiliation(s)
- R Wahl
- Institut für Mikrobiologie und Molekularbiologie, Fachbereich Biologie, Justus-Liebig-Universtät Giessen, Germany
| | | | | | | |
Collapse
|
7
|
Kröger M, Wahl R, Rice P. Compilation of DNA sequences of Escherichia coli (update 1993). Nucleic Acids Res 1993; 21:2973-3000. [PMID: 8332520 PMCID: PMC309723 DOI: 10.1093/nar/21.13.2973] [Citation(s) in RCA: 27] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
We have compiled the DNA sequence data for E. coli available from the GENBANK and EMBL data libraries and over a period of several years independently from the literature. This is the fifth listing replacing and increasing the former listings substantially. However, in order to save space this printed version contains DNA sequence information only, if they are publically available in electronic form. The complete compilation including a full set of genetic map data and the E. coli protein index can be obtained in machine readable form from the EMBL data library (ECD release 15) as a part of the CD-ROM issue of the EMBL sequence database, released and updated every three months. After deletion of all detected overlaps a total of 2,353,635 individual bp is found to be determined till the end of April 1993. This corresponds to a total of 49.87% of the entire E. coli chromosome consisting of about 4,720 kbp. This number may actually be higher by 9161 bp derived from other strains of E. coli.
Collapse
Affiliation(s)
- M Kröger
- Institut für Mikrobiologie und Molekularbiologie, Justus-Liebig-Universität Giessen, Germany
| | | | | |
Collapse
|
8
|
Rowland GC, Lim PP, Glass RE. 'Stop-codon-specific' restriction endonucleases: their use in mapping and gene manipulation. Gene 1992; 116:21-6. [PMID: 1628840 DOI: 10.1016/0378-1119(92)90624-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Certain restriction endonucleases recognise target sequences that contain the stop triplet TAG and are commonly either 4 or 6 bp in length. Interestingly, these restriction targets do not occur at the frequency expected on the basis of base composition and size. For example, the tetranucleotide MaeI recognition sequence (CTAG) occurs considerably less commonly (5-8-fold) in the genome of Escherichia coli (and many other eubacteria) than expected from mononucleotide frequencies. This surprising rarity is particularly evident in protein-encoding genes and is largely dictated by codon usage. Thus, amber (TAG) nonsense mutations frequently give rise to novel MaeI (CTAG) sites which are unique within a translated region. Such amber/MaeI sites, whether arising spontaneously or created in vitro by site-directed mutagenesis, act as a useful physical marker for the presence of the nonsense mutation and are a convenient startpoint for a range of diverse procedures. These features provide a useful supplement to protein engineering methods which use nonsense suppression to mediate amino acid replacements.
Collapse
Affiliation(s)
- G C Rowland
- Department of Biochemistry, University of Nottingham Medical School, Queen's Medical Centre, UK
| | | | | |
Collapse
|
9
|
Kröger M, Wahl R, Schachtel G, Rice P. Compilation of DNA sequences of Escherichia coli (update 1992). Nucleic Acids Res 1992; 20 Suppl:2119-44. [PMID: 1598239 PMCID: PMC333988 DOI: 10.1093/nar/20.suppl.2119] [Citation(s) in RCA: 22] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
We have compiled the DNA sequence data for E. coli available from the GENBANK and EMBL data libraries and over a period of several years independently from the literature. This is the fourth listing replacing and increasing the former listings substantially. However, in order to save space this printed version contains DNA sequence information only, if they are publically available in electronic form. The complete compilation including a full set of genetic map data and the E. coli protein index can be obtained in machine readable form from the EMBL data library (ECD release 10) or from the CD-ROM version of this supplement issue directly. After deletion of all detected overlaps a total of 1,820,237 individual bp is found to be determined till the beginning of 1992. This corresponds to a total of 38.56% of the entire E. coli chromosome consisting of about 4,720 kbp. This number may actually be higher by some extra 2.5% derived from lysogenic bacteriophage lambda and various DNA sequences already received for other strains of E. coli.
Collapse
Affiliation(s)
- M Kröger
- Institut für Mikrobiologie und Molekularbiologie, Fachbereich Biologie, Justus-Liebig-Universität Giessen, Germany
| | | | | | | |
Collapse
|
10
|
Meinnel T, Schmitt E, Mechulam Y, Blanquet S. Structural and biochemical characterization of the Escherichia coli argE gene product. J Bacteriol 1992; 174:2323-31. [PMID: 1551850 PMCID: PMC205854 DOI: 10.1128/jb.174.7.2323-2331.1992] [Citation(s) in RCA: 46] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
The DNA sequence of a 2,100-bp region containing the argE gene from Escherichia coli has been determined. The nucleotide sequence of the ppc-argE intergenic region was also solved and shown to contain six tandemly repeated REP sequences. Moreover, the oxyR gene has been mapped on the E. coli chromosome and shown to flank the arg operon. The codon responsible for the translation start of argE was determined by using site-directed mutants. This gene spans 1,400 bp and encodes a 42,350-Da polypeptide. The argE3 allele and a widely used argE amber gene have also been cloned and sequenced. N-Acetylornithinase, the argE product, has been overproduced and purified to homogeneity. Its main biochemical and catalytic properties are described. Moreover, we demonstrate that the protein is composed of two identical subunits. Finally, the amino acid sequence of N-acetylornithinase is shown to display a high degree of identity with those of the succinyldiaminopimelate desuccinylase from E. coli and carboxypeptidase G2 from a Pseudomonas sp. It is proposed that this carboxypeptidase might be responsible for the acetylornithinase-related activity found in the Pseudomonas sp.
Collapse
Affiliation(s)
- T Meinnel
- Laboratoire de Biochimie, Unité de Recherche Associée no. 240, Centre National de la Recherche Scientifique, Palaiseau, France
| | | | | | | |
Collapse
|
11
|
van Heeswijk W, Kuppinger O, Merrick M, Kahn D. Localization of the glnD gene on a revised map of the 200-kilobase region of the Escherichia coli chromosome. J Bacteriol 1992; 174:1702-3. [PMID: 1537813 PMCID: PMC206572 DOI: 10.1128/jb.174.5.1702-1703.1992] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Affiliation(s)
- W van Heeswijk
- E.C. Slater Institute for Biochemical Research, University of Amsterdam, The Netherlands
| | | | | | | |
Collapse
|
12
|
Somerville R. The Trp repressor, a ligand-activated regulatory protein. PROGRESS IN NUCLEIC ACID RESEARCH AND MOLECULAR BIOLOGY 1992; 42:1-38. [PMID: 1574585 DOI: 10.1016/s0079-6603(08)60572-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Affiliation(s)
- R Somerville
- Department of Biochemistry, Purdue University, West Lafayette, Indiana 47907
| |
Collapse
|
13
|
Abstract
The DNA sequence data for Escherichia coli deposited in the EMBL library (release 27), together with miscellaneous data obtained from several laboratories, have been localized on an updated and corrected version of the restriction map of the chromosome generated by Kohara et al. (1987) and modified by others. This second update adds a further 500 kbp, increasing the amount of the E. coli chromosome sequenced to about one third of the total: 1510 kbp of sequenced DNA is included in the present data base. The accuracy of the map is assessed, and allows us to propose a precise genetic map position for every sequenced gene. The location of rare-cutting sites such as AvrII, NotI and SfiI have also been included in the update in order to combine the data obtained from different sources into one single file. The distribution of palindromic sequences (to which most restriction sites belong) has been studied in coding sequences. There appears to be a significant counter-selection against several such sequences in E. coli coding sequences (but not in other organisms such as Saccharomyces cerevisiae), suggesting the existence of constraints on DNA structure in E. coli, perhaps indicative of a functional role for horizontal gene transfer, preserving coding sequences, in this type of bacteria.
Collapse
Affiliation(s)
- C Médigue
- Section Physique-Chimie, Institut Curie, Paris, France
| | | | | | | |
Collapse
|
14
|
Rex JH, Aronson BD, Somerville RL. The tdh and serA operons of Escherichia coli: mutational analysis of the regulatory elements of leucine-responsive genes. J Bacteriol 1991; 173:5944-53. [PMID: 1917830 PMCID: PMC208338 DOI: 10.1128/jb.173.19.5944-5953.1991] [Citation(s) in RCA: 59] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
The tdh promoter of Escherichia coli is induced seven- to eightfold when cells are grown in the presence of exogenous leucine. A scheme was devised to select mutants that exhibited high constitutive expression of the tdh promoter. The mutations in these strains were shown to lie within a previously identified gene (lrp) that encodes Lrp (leucine-responsive regulatory protein). By deletion analysis, the site of action of Lrp was localized to a 25-bp region between coordinates -69 and -44 of the tdh promoter. Disruption of a 12-bp presumptive target sequence found in this region of tdh resulted in constitutively derepressed expression from the tdh promoter. Similar DNA segments (consensus, TTTATTCtNaAT) were also identified in a number of other promoters, including each of the Lrp-regulated promoters whose nucleotide sequence is known. The sequence of the promoter region of serA, an Lrp-regulated gene, was determined. No Lrp consensus target sequence was present upstream of serA, suggesting that Lrp acts indirectly on the serA promoter. A previously described mutation in a leucine-responsive trans-acting factor, LivR (J. J. Anderson, S. C. Quay, and D. L. Oxender, J. Bacteriol. 126:80-90, 1976), resulted in constitutively repressed expression from the tdh promoter and constitutively induced expression from the serA promoter. The possibility that LivR and Lrp are allelic is discussed.
Collapse
Affiliation(s)
- J H Rex
- Department of Biochemistry, Purdue University, West Lafayette, Indiana 47907
| | | | | |
Collapse
|
15
|
Sharp PM. Determinants of DNA sequence divergence between Escherichia coli and Salmonella typhimurium: codon usage, map position, and concerted evolution. J Mol Evol 1991; 33:23-33. [PMID: 1909371 DOI: 10.1007/bf02100192] [Citation(s) in RCA: 171] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
The nature and extent of DNA sequence divergence between homologous protein-coding genes from Escherichia coli and Salmonella typhimurium have been examined. The degree of divergence varies greatly among genes at both synonymous (silent) and nonsynonymous sites. Much of the variation in silent substitution rates can be explained by natural selection on synonymous codon usage, varying in intensity with gene expression level. Silent substitution rates also vary significantly with chromosomal location, with genes near oriC having lower divergence. Certain genes have been examined in more detail. In particular, the duplicate genes encoding elongation factor Tu, tufA and tufB, from S. typhimurium have been compared to their E. coli homologues. As expected these very highly expressed genes have high codon usage bias and have diverged very little between the two species. Interestingly, these genes, which are widely spaced on the bacterial chromosome, also appear to be undergoing concerted evolution, i.e., there has been exchange between the loci subsequent to the divergence of the two species.
Collapse
Affiliation(s)
- P M Sharp
- Department of Genetics, Trinity College, Dublin, Ireland
| |
Collapse
|
16
|
Kröger M, Wahl R, Rice P. Compilation of DNA sequences of Escherichia coli (update 1991). Nucleic Acids Res 1991; 19 Suppl:2023-43. [PMID: 2041799 PMCID: PMC331345 DOI: 10.1093/nar/19.suppl.2023] [Citation(s) in RCA: 28] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
We have compiled the DNA sequence data for E. coli available from the GENBANK and EMBL data libraries and over a period of several years independently from the literature. This is the third listing replacing and increasing the former listing roughly by one fifth. However, in order to save space this printed version contains DNA sequence information only. The complete compilation is now available in machine readable form from the EMBL data library (ECD release 6). After deletion of all detected overlaps a total of 1 492,282 individual bp is found to be determined till the beginning of 1991. This corresponds to a total of 31.62% of the entire E. coli chromosome consisting of about 4,720 kbp. This number may actually be higher by some extra 2.5% derived from lysogenic bacteriophage lambda and various DNA sequences already received for statistical purposes only.
Collapse
Affiliation(s)
- M Kröger
- Institut für Mikrobiologie und Molekularbiologie, Justus-Liebig-Universität Giessen, FRG
| | | | | |
Collapse
|
17
|
Rudd KE, Miller W, Werner C, Ostell J, Tolstoshev C, Satterfield SG. Mapping sequenced E.coli genes by computer: software, strategies and examples. Nucleic Acids Res 1991; 19:637-47. [PMID: 2011534 PMCID: PMC333660 DOI: 10.1093/nar/19.3.637] [Citation(s) in RCA: 66] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Methods are presented for organizing and integrating DNA sequence data, restriction maps, and genetic maps for the same organism but from a variety of sources (databases, publications, personal communications). Proper software tools are essential for successful organization of such diverse data into an ordered, cohesive body of information, and a suite of novel software to support this endeavor is described. Though these tools automate much of the task, a variety of strategies is needed to cope with recalcitrant cases. We describe such strategies and illustrate their application with numerous examples. These strategies have allowed us to order, analyze, and display over one megabase of E. coli DNA sequence information. The integration task often exposes inconsistencies in the available data, perhaps caused by strain polymorphisms or human oversight, necessitating the application of sound biological judgment. The examples illustrate both the level of expertise required of the database curator and the knowledge gained as apparent inconsistencies are resolved. The software and mapping methods are applicable to the study of any genome for which a high resolution restriction map is available. They were developed to support a weakly coordinated sequencing effort involving many laboratories, but would also be useful for highly orchestrated sequencing projects.
Collapse
Affiliation(s)
- K E Rudd
- Laboratory of Bacterial Toxins, Food and Drug Administration, Bethesda, MD 20892
| | | | | | | | | | | |
Collapse
|
18
|
Gilson E, Saurin W, Perrin D, Bachellier S, Hofnung M. The BIME family of bacterial highly repetitive sequences. Res Microbiol 1991; 142:217-22. [PMID: 1656494 DOI: 10.1016/0923-2508(91)90033-7] [Citation(s) in RCA: 37] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Palindromic units (PU or REP) were initially defined as a DNA sequence of 40 nucleotides which is highly repeated in the genome of several enterobacteria and found in clusters of up to six copies. It appears now that PU belong to a larger repeated DNA element, of up to 300 nucleotides, called BIME for bacterial interspersed mosaic element. BIME is a mosaic combination of ten small DNA motifs, including the PU sequence. A central question concerning BIME is to determine whether they play a critical role within the cell. BIME exhibit only limited effects on local gene expression; it seems unlikely that these weak effects alone can account for the high BIME sequence homogeneity. It has recently been shown that DNA gyrase and DNA polymerase I are able to specifically recognize BIME DNA in vitro. These findings suggest that BIME could play a role in the functional organization of the bacterial nucleoid. Hypotheses on their origin and evolution are discussed.
Collapse
Affiliation(s)
- E Gilson
- Unité de Programmation Moléculaire et Toxicologie Génétique, CNRS UA271 INSERM U163, Institut Pasteur, Paris
| | | | | | | | | |
Collapse
|
19
|
Old IG, Phillips SE, Stockley PG, Saint Girons I. Regulation of methionine biosynthesis in the Enterobacteriaceae. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 1991; 56:145-85. [PMID: 1771231 DOI: 10.1016/0079-6107(91)90012-h] [Citation(s) in RCA: 50] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Affiliation(s)
- I G Old
- Département de Bactériologie et Mycologie, Institut Pasteur, Paris, France
| | | | | | | |
Collapse
|
20
|
Fuchs R, Cameron GN. Molecular biological databases: the challenge of the genome era. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 1991; 56:215-45. [PMID: 1771233 DOI: 10.1016/0079-6107(91)90014-j] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Affiliation(s)
- R Fuchs
- European Molecular Biology Laboratory, Heidelberg, Germany
| | | |
Collapse
|
21
|
Abstract
Recent progress in studies on the bacterial chromosome is summarized. Although the greatest amount of information comes from studies on Escherichia coli, reports on studies of many other bacteria are also included. A compilation of the sizes of chromosomal DNAs as determined by pulsed-field electrophoresis is given, as well as a discussion of factors that affect gene dosage, including redundancy of chromosomes on the one hand and inactivation of chromosomes on the other hand. The distinction between a large plasmid and a second chromosome is discussed. Recent information on repeated sequences and chromosomal rearrangements is presented. The growing understanding of limitations on the rearrangements that can be tolerated by bacteria and those that cannot is summarized, and the sensitive region flanking the terminator loci is described. Sources and types of genetic variation in bacteria are listed, from simple single nucleotide mutations to intragenic and intergenic recombinations. A model depicting the dynamics of the evolution and genetic activity of the bacterial chromosome is described which entails acquisition by recombination of clonal segments within the chromosome. The model is consistent with the existence of only a few genetic types of E. coli worldwide. Finally, there is a summary of recent reports on lateral genetic exchange across great taxonomic distances, yet another source of genetic variation and innovation.
Collapse
Affiliation(s)
- S Krawiec
- Department of Biology, Lehigh University, Bethlehem, Pennsylvania 18015
| | | |
Collapse
|
22
|
Hirvas L, Koski P, Vaara M. Primary structure and expression of the Ssc-protein of Salmonella typhimurium. Biochem Biophys Res Commun 1990; 173:53-9. [PMID: 2256935 DOI: 10.1016/s0006-291x(05)81020-4] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
A 1020-bp open reading frame (ORF) was found immediately downstream of the ompH gene of Salmonella typhimurium. This ORF (ORF-36) encodes a moderately hydrophobic protein with 341 amino acid residues (calculated molecular mass, 35,928 Da). The ORF-36 product was detected in minicells. Downstream of ORF-36, another ORF was found. It is highly homologous to the E. coli ORF (ORF-17.4) which precedes the lpx-genes involved in lipid A biosynthesis. ORF-36 is probably analogous to the firA gene of E. coli, the sequence of which has not yet been published. Thus it appears that the enterobacterial ompH and lpx genes are separated only by the ORF-36 and ORF-17.4 genes. We also discuss the data on the function of the ORF-36 protein. On this basis, we suggest that the protein could be called the Ssc protein.
Collapse
Affiliation(s)
- L Hirvas
- Department of Bacteriology and Immunology, University of Helsinki, Finland
| | | | | |
Collapse
|
23
|
Sharples GJ, Lloyd RG. A novel repeated DNA sequence located in the intergenic regions of bacterial chromosomes. Nucleic Acids Res 1990; 18:6503-8. [PMID: 2251112 PMCID: PMC332602 DOI: 10.1093/nar/18.22.6503] [Citation(s) in RCA: 111] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
We report the discovery of a novel group of highly conserved DNA sequences located within the intergenic regions of the chromosomes of Escherichia coli, Salmonella typhimurium and other bacteria. These intergenic repeat units (IRUs) are 124-127 nucleotides long and have the potential to form stable stem-loop structures. The location of these sequences within the intergenic regions is variable with respect to known or putative signals for transcription and translation of the flanking genes. Some of the IRU sequences are transcribed, others are probably not. The structure and possible functions of these sequences are discussed in relation to palindromic units and other repeated DNA sequences in bacteria.
Collapse
Affiliation(s)
- G J Sharples
- Department of Genetics, University of Nottingham, Medical School, Queens Medical Centre, UK
| | | |
Collapse
|