1
|
In silico discovery of the myxosortases that process MYXO-CTERM and three novel prokaryotic C-terminal protein-sorting signals that share invariant Cys residues. J Bacteriol 2024; 206:e0017323. [PMID: 38084967 PMCID: PMC10810001 DOI: 10.1128/jb.00173-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/10/2023] [Indexed: 01/26/2024] Open
Abstract
The LPXTG protein-sorting signal, found in surface proteins of various Gram-positive pathogens, was the founding member of a growing panel of prokaryotic small C-terminal sorting domains. Sortase A cleaves LPXTG, exosortases (XrtA and XrtB) cleave the PEP-CTERM sorting signal, archaeosortase A cleaves PGF-CTERM, and rhombosortase cleaves GlyGly-CTERM domains. Four sorting signal domains without previously known processing proteases are the MYXO-CTERM, JDVT-CTERM, Synerg-CTERM, and CGP-CTERM domains. These exhibit the standard tripartite architecture of a short signature motif, a hydrophobic transmembrane segment, and an Arg-rich cluster. Each has an invariant cysteine in its signature motif. Computational evidence strongly suggests that each of these four Cys-containing sorting signals is processed, at least in part, by a cognate family of glutamic-type intramembrane endopeptidases related to the eukaryotic type II CAAX-processing protease Rce1. For the MYXO-CTERM sorting signals of different lineages, their sorting enzymes, called myxosortases, include MrtX (MXAN_2755 in Myxococcus xanthus), MrtC, and MrtP, all with radically different N-terminal domains but with a conserved core. Related predicted sorting enzymes were also identified for JDVT-CTERM (MrtJ), Synerg-CTERM (MrtS), and CGP-CTERM (MrtA). This work establishes a major new family of protein-sorting housekeeping endopeptidases contributing to the surface attachment of proteins in prokaryotes. IMPORTANCE Homologs of the eukaryotic type II CAAX-box protease Rce1, a membrane-embedded endopeptidase found in yeast and human ER and involved in sorting proteins to their proper cellular locations, are abundant in prokaryotes but not well understood there. This bioinformatics paper identifies several subgroups of the family as cognate endopeptidases for four protein-sorting signals processed by previously unknown machinery. Sorting signals with newly identified processing enzymes include three novel ones, but also MYXO-CTERM, which had been the focus of previous experimental work in the model fruiting and gliding bacterium Myxococcus xanthus. The new findings will substantially improve our understanding of Cys-containing C-terminal protein-sorting signals and of protein trafficking generally in bacteria and archaea.
Collapse
|
2
|
RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes. Nucleic Acids Res 2024; 52:D762-D769. [PMID: 37962425 PMCID: PMC10767926 DOI: 10.1093/nar/gkad988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/13/2023] [Accepted: 10/18/2023] [Indexed: 11/15/2023] Open
Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap.
Collapse
|
3
|
Abstract
Antimicrobial resistance (AMR) is a significant public health threat. Low-cost whole-genome sequencing, which is often used in surveillance programmes, provides an opportunity to assess AMR gene content in these genomes using in silico approaches. A variety of bioinformatic tools have been developed to identify these genomic elements. Most of those tools rely on reference databases of nucleotide or protein sequences and collections of models and rules for analysis. While the tools are critical for the identification of AMR genes, the databases themselves also provide significant utility for researchers, for applications ranging from sequence analysis to information about AMR phenotypes. Additionally, these databases can be evaluated by domain experts and others to ensure their accuracy. Here we describe how we curate the genes, point mutations and blast rules, and hidden Markov models used in NCBI’s AMRFinderPlus, along with the quality-control steps we take to ensure database quality. We also describe the web interfaces that display the full structure of the database and their newly developed cross-browser relationships. Then, using the Reference Gene Catalog as an example, we detail how the databases, rules and models are made publicly available, as well as how to access the software. In addition, as part of the Pathogen Detection system, we have analysed over 1 million publicly available genomes using AMRFinderPlus and its databases. We discuss how the computed analyses generated by those tools can be accessed through a web interface. Finally, we conclude with NCBI’s plans to make these databases accessible over the long-term.
Collapse
|
4
|
Folding the unfoldable: using AlphaFold to explore spurious proteins. BIOINFORMATICS ADVANCES 2022; 2:vbab043. [PMID: 36699409 PMCID: PMC9710616 DOI: 10.1093/bioadv/vbab043] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 12/07/2021] [Accepted: 12/14/2021] [Indexed: 01/28/2023]
Abstract
Motivation The release of AlphaFold 2.0 has revolutionized our ability to determine protein structures from sequences. This tool also inadvertently opens up many unanticipated opportunities. In this article, we investigate the AntiFam resource, which contains 250 protein sequence families that we believe to be spurious protein translations. We would not expect proteins belonging to these families to fold into well-ordered globular structures. To test this hypothesis, we have attempted to computationally determine the structure of a representative sequence from all AntiFam 6.0 families. Results Although the large majority of families showed no evidence of globular structure, we have identified one example for which a globular structure is predicted. Proteins in this AntiFam entry indeed seem likely to be bona fide proteins, based on additional considerations, and thus AlphaFold provides a useful quality control for the AntiFam database. Conversely, known spurious proteins offer useful set of quality controls for AlphaFold. We have identified a trend that the mean structure prediction confidence score pLDDT is higher for shorter sequences. Of the 131 AntiFam representative sequences <100 amino acids in length, AlphaFold predicts a mean pLDDT of 80 or greater for six of them. Thus, particular care should be taken when applying AlphaFold to short protein sequences. Availability and implementation The AlphaFold predictions for representative sequences can be found at the following URL: https://drive.google.com/drive/folders/1u9OocRIAabGQn56GljoG1JTDAxjkY1ro. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
|
5
|
AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep 2021; 11:12728. [PMID: 34135355 PMCID: PMC8208984 DOI: 10.1038/s41598-021-91456-0] [Citation(s) in RCA: 296] [Impact Index Per Article: 98.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 05/19/2021] [Indexed: 12/26/2022] Open
Abstract
Antimicrobial resistance (AMR) is a significant public health threat. With the rise of affordable whole genome sequencing, in silico approaches to assessing AMR gene content can be used to detect known resistance mechanisms and potentially identify novel mechanisms. To enable accurate assessment of AMR gene content, as part of a multi-agency collaboration, NCBI developed a comprehensive AMR gene database, the Bacterial Antimicrobial Resistance Reference Gene Database and the AMR gene detection tool AMRFinder. Here, we describe the expansion of the Reference Gene Database, now called the Reference Gene Catalog, to include putative acid, biocide, metal, stress resistance genes, in addition to virulence genes and species-specific point mutations. Genes and point mutations are classified by broad functions, as well as more detailed functions. As we have expanded both the functional repertoire of identified genes and functionality, NCBI released a new version of AMRFinder, known as AMRFinderPlus. This new tool allows users the option to utilize only the core set of AMR elements, or include stress response and virulence genes, too. AMRFinderPlus can detect acquired genes and point mutations in both protein and nucleotide sequence. In addition, the evidence used to identify the gene has been expanded to include whether nucleotide or protein sequence was used, its location in the contig, and presence of an internal stop codon. These database improvements and functional expansions will enable increased precision in identifying AMR genes, linking AMR genotypes and phenotypes, and determining possible relationships between AMR, virulence, and stress response.
Collapse
|
6
|
The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 2021; 49:D344-D354. [PMID: 33156333 PMCID: PMC7778928 DOI: 10.1093/nar/gkaa977] [Citation(s) in RCA: 1048] [Impact Index Per Article: 349.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/08/2020] [Accepted: 10/23/2020] [Indexed: 01/22/2023] Open
Abstract
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Collapse
|
7
|
RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res 2021; 49:D1020-D1028. [PMID: 33270901 PMCID: PMC7779008 DOI: 10.1093/nar/gkaa1105] [Citation(s) in RCA: 433] [Impact Index Per Article: 144.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/19/2020] [Accepted: 11/02/2020] [Indexed: 11/14/2022] Open
Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.
Collapse
|
8
|
Evolutionary classification of CRISPR–Cas systems: a burst of class 2 and derived variants. Nat Rev Microbiol 2019; 18:67-83. [DOI: 10.1038/s41579-019-0299-x] [Citation(s) in RCA: 797] [Impact Index Per Article: 159.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/04/2019] [Indexed: 12/16/2022]
|
9
|
RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 2019; 46:D851-D860. [PMID: 29112715 PMCID: PMC5753331 DOI: 10.1093/nar/gkx1068] [Citation(s) in RCA: 615] [Impact Index Per Article: 123.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Accepted: 10/25/2017] [Indexed: 12/31/2022] Open
Abstract
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) provides annotation for over 95 000 prokaryotic genomes that meet standards for sequence quality, completeness, and freedom from contamination. Genomes are annotated by a single Prokaryotic Genome Annotation Pipeline (PGAP) to provide users with a resource that is as consistent and accurate as possible. Notable recent changes include the development of a hierarchical evidence scheme, a new focus on curating annotation evidence sources, the addition and curation of protein profile hidden Markov models (HMMs), release of an updated pipeline (PGAP-4), and comprehensive re-annotation of RefSeq prokaryotic genomes. Antimicrobial resistance proteins have been reannotated comprehensively, improved structural annotation of insertion sequence transposases and selenoproteins is provided, curated complex domain architectures have given upgraded names to millions of multidomain proteins, and we introduce a new kind of annotation rule—BlastRules. Continual curation of supporting evidence, and propagation of improved names onto RefSeq proteins ensures that the functional annotation of genomes is kept current. An increasing share of our annotation now derives from HMMs and other sets of annotation rules that are portable by nature, and available for download and for reuse by other investigators. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.
Collapse
|
10
|
Proposal for assignment of allele numbers for mobile colistin resistance (mcr) genes. J Antimicrob Chemother 2018; 73:2625-2630. [PMID: 30053115 PMCID: PMC6148208 DOI: 10.1093/jac/dky262] [Citation(s) in RCA: 81] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
The initial report of the mcr-1 (mobile colistin resistance) gene has led to many reports of mcr-1 variants and other mcr genes from different bacterial species originating from human, animal and environmental samples in different geographical locations. Resistance gene nomenclature is complex and unfortunately problems such as different names being used for the same gene/protein or the same name being used for different genes/proteins are not uncommon. Registries exist for some families, such as bla (β-lactamase) genes, but there is as yet no agreed nomenclature scheme for mcr genes. The National Center for Biotechnology Information (NCBI) recently took over assigning bla allele numbers from the longstanding Lahey β-lactamase website and has agreed to do the same for mcr genes. Here, we propose a nomenclature scheme that we hope will be acceptable to researchers in this area and that will reduce future confusion.
Collapse
|
11
|
Both widespread PEP-CTERM proteins and exopolysaccharides are required for floc formation of Zoogloea resiniphila and other activated sludge bacteria. Environ Microbiol 2018; 20:1677-1692. [PMID: 29473278 DOI: 10.1111/1462-2920.14080] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2017] [Revised: 02/12/2018] [Accepted: 02/19/2018] [Indexed: 12/28/2022]
Abstract
Bacterial floc formation plays a central role in the activated sludge (AS) process, which has been widely utilized for sewage and wastewater treatment. The formation of AS flocs has long been known to require exopolysaccharide biosynthesis. This study demonstrates an additional requirement for a PEP-CTERM protein in Zoogloea resiniphila, a dominant AS bacterium harboring a large exopolysaccharide biosynthesis gene cluster. Two members of a wide-spread family of high copy number-per-genome PEP-CTERM genes, transcriptionally regulated by the RpoN sigma factor and accessory PrsK-PrsR two-component system and at least one of these, pepA, must be expressed for Zoogloea to build the floc structures that allow gravitational sludge settling and recycling. Without PrsK or PrsR, Zoogloea cells were planktonic rather than flocculated and secreted exopolysaccharides were released into the growth broth in soluble form. Overexpression of PepA could circumvent the requirement of rpoN, prsK and prsR for the floc-forming phenotype by fixing the exopolysaccharides to bacterial cells. However, overexpression of PepA, which underwent post-translational modifications, could not rescue the long-rod morphology of the rpoN mutant. Consistently, PEP-CTERM genes and exopolysaccharide biosynthesis gene cluster are present in the genome of the floc-forming Nitrospira comammox and Mitsuaria strain as well as many other AS bacteria.
Collapse
|
12
|
A comprehensive software suite for protein family construction and functional site prediction. PLoS One 2017; 12:e0171758. [PMID: 28182651 PMCID: PMC5300114 DOI: 10.1371/journal.pone.0171758] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2016] [Accepted: 01/25/2017] [Indexed: 11/18/2022] Open
Abstract
In functionally diverse protein families, conservation in short signature regions may outperform full-length sequence comparisons for identifying proteins that belong to a subgroup within which one specific aspect of their function is conserved. The SIMBAL workflow (Sites Inferred by Metabolic Background Assertion Labeling) is a data-mining procedure for finding such signature regions. It begins by using clues from genomic context, such as co-occurrence or conserved gene neighborhoods, to build a useful training set from a large number of uncharacterized but mutually homologous proteins. When training set construction is successful, the YES partition is enriched in proteins that share function with the user’s query sequence, while the NO partition is depleted. A selected query sequence is then mined for short signature regions whose closest matches overwhelmingly favor proteins from the YES partition. High-scoring signature regions typically contain key residues critical to functional specificity, so proteins with the highest sequence similarity across these regions tend to share the same function. The SIMBAL algorithm was described previously, but significant manual effort, expertise, and a supporting software infrastructure were required to prepare the requisite training sets. Here, we describe a new, distributable software suite that speeds up and simplifies the process for using SIMBAL, most notably by providing tools that automate training set construction. These tools have broad utility for comparative genomics, allowing for flexible collection of proteins or protein domains based on genomic context as well as homology, a capability that can greatly assist in protein family construction. Armed with this new software suite, SIMBAL can serve as a fast and powerful in silico alternative to direct experimentation for characterizing proteins and their functional interactions.
Collapse
|
13
|
Mycofactocin-associated mycobacterial dehydrogenases with non-exchangeable NAD cofactors. Sci Rep 2017; 7:41074. [PMID: 28120876 PMCID: PMC5264612 DOI: 10.1038/srep41074] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2016] [Accepted: 12/12/2016] [Indexed: 01/08/2023] Open
Abstract
During human infection, Mycobacterium tuberculosis (Mtb) survives the normally bacteriocidal phagosome of macrophages. Mtb and related species may be able to combat this harsh acidic environment which contains reactive oxygen species due to the mycobacterial genomes encoding a large number of dehydrogenases. Typically, dehydrogenase cofactor binding sites are open to solvent, which allows NAD/NADH exchange to support multiple turnover. Interestingly, mycobacterial short chain dehydrogenases/reductases (SDRs) within family TIGR03971 contain an insertion at the NAD binding site. Here we present crystal structures of 9 mycobacterial SDRs in which the insertion buries the NAD cofactor except for a small portion of the nicotinamide ring. Line broadening and STD-NMR experiments did not show NAD or NADH exchange on the NMR timescale. STD-NMR demonstrated binding of the potential substrate carveol, the potential product carvone, the inhibitor tricyclazol, and an external redox partner 2,6-dichloroindophenol (DCIP). Therefore, these SDRs appear to contain a non-exchangeable NAD cofactor and may rely on an external redox partner, rather than cofactor exchange, for multiple turnover. Incidentally, these genes always appear in conjunction with the mftA gene, which encodes the short peptide MftA, and with other genes proposed to convert MftA into the external redox partner mycofactocin.
Collapse
|
14
|
What Makes a Bacterial Species Pathogenic?:Comparative Genomic Analysis of the Genus Leptospira. PLoS Negl Trop Dis 2016; 10:e0004403. [PMID: 26890609 PMCID: PMC4758666 DOI: 10.1371/journal.pntd.0004403] [Citation(s) in RCA: 194] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Accepted: 01/03/2016] [Indexed: 12/20/2022] Open
Abstract
Leptospirosis, caused by spirochetes of the genus Leptospira, is a globally widespread, neglected and emerging zoonotic disease. While whole genome analysis of individual pathogenic, intermediately pathogenic and saprophytic Leptospira species has been reported, comprehensive cross-species genomic comparison of all known species of infectious and non-infectious Leptospira, with the goal of identifying genes related to pathogenesis and mammalian host adaptation, remains a key gap in the field. Infectious Leptospira, comprised of pathogenic and intermediately pathogenic Leptospira, evolutionarily diverged from non-infectious, saprophytic Leptospira, as demonstrated by the following computational biology analyses: 1) the definitive taxonomy and evolutionary relatedness among all known Leptospira species; 2) genomically-predicted metabolic reconstructions that indicate novel adaptation of infectious Leptospira to mammals, including sialic acid biosynthesis, pathogen-specific porphyrin metabolism and the first-time demonstration of cobalamin (B12) autotrophy as a bacterial virulence factor; 3) CRISPR/Cas systems demonstrated only to be present in pathogenic Leptospira, suggesting a potential mechanism for this clade's refractoriness to gene targeting; 4) finding Leptospira pathogen-specific specialized protein secretion systems; 5) novel virulence-related genes/gene families such as the Virulence Modifying (VM) (PF07598 paralogs) proteins and pathogen-specific adhesins; 6) discovery of novel, pathogen-specific protein modification and secretion mechanisms including unique lipoprotein signal peptide motifs, Sec-independent twin arginine protein secretion motifs, and the absence of certain canonical signal recognition particle proteins from all Leptospira; and 7) and demonstration of infectious Leptospira-specific signal-responsive gene expression, motility and chemotaxis systems. By identifying large scale changes in infectious (pathogenic and intermediately pathogenic) vs. non-infectious Leptospira, this work provides new insights into the evolution of a genus of bacterial pathogens. This work will be a comprehensive roadmap for understanding leptospirosis pathogenesis. More generally, it provides new insights into mechanisms by which bacterial pathogens adapt to mammalian hosts.
Collapse
|
15
|
Abstract
The evolution of CRISPR-cas loci, which encode adaptive immune systems in archaea and bacteria, involves rapid changes, in particular numerous rearrangements of the locus architecture and horizontal transfer of complete loci or individual modules. These dynamics complicate straightforward phylogenetic classification, but here we present an approach combining the analysis of signature protein families and features of the architecture of cas loci that unambiguously partitions most CRISPR-cas loci into distinct classes, types and subtypes. The new classification retains the overall structure of the previous version but is expanded to now encompass two classes, five types and 16 subtypes. The relative stability of the classification suggests that the most prevalent variants of CRISPR-Cas systems are already known. However, the existence of rare, currently unclassifiable variants implies that additional types and subtypes remain to be characterized.
Collapse
|
16
|
Creating a specialist protein resource network: a meeting report for the protein bioinformatics and community resources retreat. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav063. [PMID: 26284514 PMCID: PMC4499208 DOI: 10.1093/database/bav063] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2015] [Revised: 05/14/2015] [Accepted: 05/18/2015] [Indexed: 11/14/2022]
Abstract
During 11–12 August 2014, a Protein Bioinformatics and Community Resources Retreat was held at the Wellcome Trust Genome Campus in Hinxton, UK. This meeting brought together the principal investigators of several specialized protein resources (such as CAZy, TCDB and MEROPS) as well as those from protein databases from the large Bioinformatics centres (including UniProt and RefSeq). The retreat was divided into five sessions: (1) key challenges, (2) the databases represented, (3) best practices for maintenance and curation, (4) information flow to and from large data centers and (5) communication and funding. An important outcome of this meeting was the creation of a Specialist Protein Resource Network that we believe will improve coordination of the activities of its member resources. We invite further protein database resources to join the network and continue the dialogue.
Collapse
|
17
|
Using comparative genomics to drive new discoveries in microbiology. Curr Opin Microbiol 2015; 23:189-96. [PMID: 25617609 DOI: 10.1016/j.mib.2014.11.017] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2014] [Revised: 11/19/2014] [Accepted: 11/20/2014] [Indexed: 01/17/2023]
Abstract
Bioinformatics looks to many microbiologists like a service industry. In this view, annotation starts with what is known from experiments in the lab, makes reasonable inferences of which genes match other genes in function, builds databases to make all that we know accessible, but creates nothing truly new. Experiments lead, then biocuration and computational biology follow. But the astounding success of genome sequencing is changing the annotation paradigm. Every genome sequenced is an intercepted coded message from the microbial world, and as all cryptographers know, it is easier to decode a thousand messages than a single message. Some biology is best discovered not by phenomenology, but by decoding genome content, forming hypotheses, and doing the first few rounds of validation computationally. Through such reasoning, a role and function may be assigned to a protein with no sequence similarity to any protein yet studied. Experimentation can follow after the discovery to cement and to extend the findings. Unfortunately, this approach remains so unfamiliar to most bench scientists that lab work and comparative genomics typically segregate to different teams working on unconnected projects. This review will discuss several themes in comparative genomics as a discovery method, including highly derived data, use of patterns of design to reason by analogy, and in silico testing of computationally generated hypotheses.
Collapse
|
18
|
Simultaneous non-contiguous deletions using large synthetic DNA and site-specific recombinases. Nucleic Acids Res 2014; 42:e111. [PMID: 24914053 PMCID: PMC4132700 DOI: 10.1093/nar/gku509] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Toward achieving rapid and large scale genome modification directly in a target organism, we have developed a new genome engineering strategy that uses a combination of bioinformatics aided design, large synthetic DNA and site-specific recombinases. Using Cre recombinase we swapped a target 126-kb segment of the Escherichia coli genome with a 72-kb synthetic DNA cassette, thereby effectively eliminating over 54 kb of genomic DNA from three non-contiguous regions in a single recombination event. We observed complete replacement of the native sequence with the modified synthetic sequence through the action of the Cre recombinase and no competition from homologous recombination. Because of the versatility and high-efficiency of the Cre-lox system, this method can be used in any organism where this system is functional as well as adapted to use with other highly precise genome engineering systems. Compared to present-day iterative approaches in genome engineering, we anticipate this method will greatly speed up the creation of reduced, modularized and optimized genomes through the integration of deletion analyses data, transcriptomics, synthetic biology and site-specific recombination.
Collapse
|
19
|
Pathogenomic inference of virulence-associated genes in Leptospira interrogans. PLoS Negl Trop Dis 2013; 7:e2468. [PMID: 24098822 PMCID: PMC3789758 DOI: 10.1371/journal.pntd.0002468] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2013] [Accepted: 08/23/2013] [Indexed: 11/18/2022] Open
Abstract
Leptospirosis is a globally important, neglected zoonotic infection caused by spirochetes of the genus Leptospira. Since genetic transformation remains technically limited for pathogenic Leptospira, a systems biology pathogenomic approach was used to infer leptospiral virulence genes by whole genome comparison of culture-attenuated Leptospira interrogans serovar Lai with its virulent, isogenic parent. Among the 11 pathogen-specific protein-coding genes in which non-synonymous mutations were found, a putative soluble adenylate cyclase with host cell cAMP-elevating activity, and two members of a previously unstudied ∼15 member paralogous gene family of unknown function were identified. This gene family was also uniquely found in the alpha-proteobacteria Bartonella bacilliformis and Bartonella australis that are geographically restricted to the Andes and Australia, respectively. How the pathogenic Leptospira and these two Bartonella species came to share this expanded gene family remains an evolutionary mystery. In vivo expression analyses demonstrated up-regulation of 10/11 Leptospira genes identified in the attenuation screen, and profound in vivo, tissue-specific up-regulation by members of the paralogous gene family, suggesting a direct role in virulence and host-pathogen interactions. The pathogenomic experimental design here is generalizable as a functional systems biology approach to studying bacterial pathogenesis and virulence and should encourage similar experimental studies of other pathogens.
Collapse
|
20
|
Methanobactin and MmoD work in concert to act as the 'copper-switch' in methanotrophs. Environ Microbiol 2013; 15:3077-86. [PMID: 23682956 DOI: 10.1111/1462-2920.12150] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2013] [Accepted: 04/21/2013] [Indexed: 10/26/2022]
Abstract
Biological oxidation of methane to methanol by aerobic bacteria is catalysed by two different enzymes, the cytoplasmic or soluble methane monooxygenase (sMMO) and the membrane-bound or particulate methane monooxygenase (pMMO). Expression of MMOs is controlled by a 'copper-switch', i.e. sMMO is only expressed at very low copper : biomass ratios, while pMMO expression increases as this ratio increases. Methanotrophs synthesize a chalkophore, methanobactin, for the binding and import of copper. Previous work suggested that methanobactin was formed from a polypeptide precursor. Here we report that deletion of the gene suspected to encode for this precursor, mbnA, in Methylosinus trichosporium OB3b, abolishes methanobactin production. Further, gene expression assays indicate that methanobactin, together with another polypeptide of previously unknown function, MmoD, play key roles in regulating expression of MMOs. Based on these data, we propose a general model explaining how expression of the MMO operons is regulated by copper, methanobactin and MmoD. The basis of the 'copper-switch' is MmoD, and methanobactin amplifies the magnitude of the switch. Bioinformatic analysis of bacterial genomes indicates that the production of methanobactin-like compounds is not confined to methanotrophs, suggesting that its use as a metal-binding agent and/or role in gene regulation may be widespread in nature.
Collapse
|
21
|
Ribosomally synthesized and post-translationally modified peptide natural products: overview and recommendations for a universal nomenclature. Nat Prod Rep 2013; 30:108-60. [PMID: 23165928 DOI: 10.1039/c2np20085f] [Citation(s) in RCA: 1423] [Impact Index Per Article: 129.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
This review presents recommended nomenclature for the biosynthesis of ribosomally synthesized and post-translationally modified peptides (RiPPs), a rapidly growing class of natural products. The current knowledge regarding the biosynthesis of the >20 distinct compound classes is also reviewed, and commonalities are discussed.
Collapse
|
22
|
Abstract
TIGRFAMs, available online at http://www.jcvi.org/tigrfams is a database of protein family definitions. Each entry features a seed alignment of trusted representative sequences, a hidden Markov model (HMM) built from that alignment, cutoff scores that let automated annotation pipelines decide which proteins are members, and annotations for transfer onto member proteins. Most TIGRFAMs models are designated equivalog, meaning they assign a specific name to proteins conserved in function from a common ancestral sequence. Models describing more functionally heterogeneous families are designated subfamily or domain, and assign less specific but more widely applicable annotations. The Genome Properties database, available at http://www.jcvi.org/genome-properties, specifies how computed evidence, including TIGRFAMs HMM results, should be used to judge whether an enzymatic pathway, a protein complex or another type of molecular subsystem is encoded in a genome. TIGRFAMs and Genome Properties content are developed in concert because subsystems reconstruction for large numbers of genomes guides selection of seed alignment sequences and cutoff values during protein family construction. Both databases specialize heavily in bacterial and archaeal subsystems. At present, 4284 models appear in TIGRFAMs, while 628 systems are described by Genome Properties. Content derives both from subsystem discovery work and from biocuration of the scientific literature.
Collapse
|
23
|
|
24
|
Cell contact-dependent outer membrane exchange in myxobacteria: genetic determinants and mechanism. PLoS Genet 2012; 8:e1002626. [PMID: 22511878 PMCID: PMC3325183 DOI: 10.1371/journal.pgen.1002626] [Citation(s) in RCA: 88] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2011] [Accepted: 02/14/2012] [Indexed: 01/15/2023] Open
Abstract
Biofilms are dense microbial communities. Although widely distributed and medically important, how biofilm cells interact with one another is poorly understood. Recently, we described a novel process whereby myxobacterial biofilm cells exchange their outer membrane (OM) lipoproteins. For the first time we report here the identification of two host proteins, TraAB, required for transfer. These proteins are predicted to localize in the cell envelope; and TraA encodes a distant PA14 lectin-like domain, a cysteine-rich tandem repeat region, and a putative C-terminal protein sorting tag named MYXO-CTERM, while TraB encodes an OmpA-like domain. Importantly, TraAB are required in donors and recipients, suggesting bidirectional transfer. By use of a lipophilic fluorescent dye, we also discovered that OM lipids are exchanged. Similar to lipoproteins, dye transfer requires TraAB function, gliding motility and a structured biofilm. Importantly, OM exchange was found to regulate swarming and development behaviors, suggesting a new role in cell–cell communication. A working model proposes TraA is a cell surface receptor that mediates cell–cell adhesion for OM fusion, in which lipoproteins/lipids are transferred by lateral diffusion. We further hypothesize that cell contact–dependent exchange helps myxobacteria to coordinate their social behaviors. All cells interact with their environment, including other cells, to elicit cellular responses. Cell–cell interactions between eukaryotic cells are widely appreciated as large multicellular organisms coordinate cell behaviors for tissue and organ functions. In bacteria cell–cell interactions are not widely appreciated, as these organisms are relatively simple and are often depicted as single-cell entities. However, over the past decade, the concept of bacteria living in microbial communities or biofilms has received broad acceptance as a major lifestyle. As biofilm cells are packed in tight physical contact, there is an opportunity for cell–cell signaling to provide spatial and physiological clues of neighboring cells to elicit cellular responses. Although much has been learned about diffusible signals through quorum sensing, little is known about cell contact–dependent signaling in bacteria. In this report we describe a new mechanism where bacterial cells within structured biofilms form contacts that allow cellular material to be exchanged. This exchange elicits phenotypic changes, including in cell movements and development. We hypothesize that OM exchange involves kin recognition that bestows social benefits to myxobacterial populations.
Collapse
|
25
|
AntiFam: a tool to help identify spurious ORFs in protein annotation. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas003. [PMID: 22434837 PMCID: PMC3308159 DOI: 10.1093/database/bas003] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
As the deluge of genomic DNA sequence grows the fraction of protein sequences that have been manually curated falls. In turn, as the number of laboratories with the ability to sequence genomes in a high-throughput manner grows, the informatics capability of those labs to accurately identify and annotate all genes within a genome may often be lacking. These issues have led to fears about transitive annotation errors making sequence databases less reliable. During the lifetime of the Pfam protein families database a number of protein families have been built, which were later identified as composed solely of spurious open reading frames (ORFs) either on the opposite strand or in a different, overlapping reading frame with respect to the true protein-coding or non-coding RNA gene. These families were deleted and are no longer available in Pfam. However, we realized that these may perform a useful function to identify new spurious ORFs. We have collected these families together in AntiFam along with additional custom-made families of spurious ORFs. This resource currently contains 23 families that identified 1310 spurious proteins in UniProtKB and a further 4119 spurious proteins in a collection of metagenomic sequences. UniProt has adopted AntiFam as a part of the UniProtKB quality control process and will investigate these spurious proteins for exclusion.
Collapse
|
26
|
GlyGly-CTERM and rhombosortase: a C-terminal protein processing signal in a many-to-one pairing with a rhomboid family intramembrane serine protease. PLoS One 2011; 6:e28886. [PMID: 22194940 PMCID: PMC3237569 DOI: 10.1371/journal.pone.0028886] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2011] [Accepted: 11/16/2011] [Indexed: 11/30/2022] Open
Abstract
The rhomboid family of serine proteases occurs in all domains of life. Its members contain at least six hydrophobic membrane-spanning helices, with an active site serine located deep within the hydrophobic interior of the plasma membrane. The model member GlpG from Escherichia coli is heavily studied through engineered mutant forms, varied model substrates, and multiple X-ray crystal studies, yet its relationship to endogenous substrates is not well understood. Here we describe an apparent membrane anchoring C-terminal homology domain that appears in numerous genera including Shewanella, Vibrio, Acinetobacter, and Ralstonia, but excluding Escherichia and Haemophilus. Individual genomes encode up to thirteen members, usually homologous to each other only in this C-terminal region. The domain's tripartite architecture consists of motif, transmembrane helix, and cluster of basic residues at the protein C-terminus, as also seen with the LPXTG recognition sequence for sortase A and the PEP-CTERM recognition sequence for exosortase. Partial Phylogenetic Profiling identifies a distinctive rhomboid-like protease subfamily almost perfectly co-distributed with this recognition sequence. This protease subfamily and its putative target domain are hereby renamed rhombosortase and GlyGly-CTERM, respectively. The protease and target are encoded by consecutive genes in most genomes with just a single target, but far apart otherwise. The signature motif of the Rhombo-CTERM domain, often SGGS, only partially resembles known cleavage sites of rhomboid protease family model substrates. Some protein families that have several members with C-terminal GlyGly-CTERM domains also have additional members with LPXTG or PEP-CTERM domains instead, suggesting there may be common themes to the post-translational processing of these proteins by three different membrane protein superfamilies.
Collapse
|
27
|
ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process. BMC Bioinformatics 2011; 12:434. [PMID: 22070167 PMCID: PMC3226654 DOI: 10.1186/1471-2105-12-434] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2011] [Accepted: 11/09/2011] [Indexed: 12/02/2022] Open
Abstract
Background Phylogenetic profiling is a technique of scoring co-occurrence between a protein family and some other trait, usually another protein family, across a set of taxonomic groups. In spite of several refinements in recent years, the technique still invites significant improvement. To be its most effective, a phylogenetic profiling algorithm must be able to examine co-occurrences among protein families whose boundaries are uncertain within large homologous protein superfamilies. Results Partial Phylogenetic Profiling (PPP) is an iterative algorithm that scores a given taxonomic profile against the taxonomic distribution of families for all proteins in a genome. The method works through optimizing the boundary of each protein family, rather than by relying on prebuilt protein families or fixed sequence similarity thresholds. Double Partial Phylogenetic Profiling (DPPP) is a related procedure that begins with a single sequence and searches for optimal granularities for its surrounding protein family in order to generate the best query profiles for PPP. We present ProPhylo, a high-performance software package for phylogenetic profiling studies through creating individually optimized protein family boundaries. ProPhylo provides precomputed databases for immediate use and tools for manipulating the taxonomic profiles used as queries. Conclusion ProPhylo results show universal markers of methanogenesis, a new DNA phosphorothioation-dependent restriction enzyme, and efficacy in guiding protein family construction. The software and the associated databases are freely available under the open source Perl Artistic License from ftp://ftp.jcvi.org/pub/data/ppp/.
Collapse
|
28
|
Bioinformatic evidence for a widely distributed, ribosomally produced electron carrier precursor, its maturation proteins, and its nicotinoprotein redox partners. BMC Genomics 2011; 12:21. [PMID: 21223593 PMCID: PMC3023750 DOI: 10.1186/1471-2164-12-21] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2010] [Accepted: 01/11/2011] [Indexed: 11/10/2022] Open
Abstract
Background Enzymes in the radical SAM (rSAM) domain family serve in a wide variety of biological processes, including RNA modification, enzyme activation, bacteriocin core peptide maturation, and cofactor biosynthesis. Evolutionary pressures and relationships to other cellular constituents impose recognizable grammars on each class of rSAM-containing system, shaping patterns in results obtained through various comparative genomics analyses. Results An uncharacterized gene cluster found in many Actinobacteria and sporadically in Firmicutes, Chloroflexi, Deltaproteobacteria, and one Archaeal plasmid contains a PqqE-like rSAM protein family that includes Rv0693 from Mycobacterium tuberculosis. Members occur clustered with a strikingly well-conserved small polypeptide we designate "mycofactocin," similar in size to bacteriocins and PqqA, precursor of pyrroloquinoline quinone (PQQ). Partial Phylogenetic Profiling (PPP) based on the distribution of these markers identifies the mycofactocin cluster, but also a second tier of high-scoring proteins. This tier, strikingly, is filled with up to thirty-one members per genome from three variant subfamilies that occur, one each, in three unrelated classes of nicotinoproteins. The pattern suggests these variant enzymes require not only NAD(P), but also the novel gene cluster. Further study was conducted using SIMBAL, a PPP-like tool, to search these nicotinoproteins for subsequences best correlated across multiple genomes to the presence of mycofactocin. For both the short chain dehydrogenase/reductase (SDR) and iron-containing dehydrogenase families, aligning SIMBAL's top-scoring sequences to homologous solved crystal structures shows signals centered over NAD(P)-binding sites rather than over substrate-binding or active site residues. Previous studies on some of these proteins have revealed a non-exchangeable NAD cofactor, such that enzymatic activity in vitro requires an artificial electron acceptor such as N,N-dimethyl-4-nitrosoaniline (NDMA) for the enzyme to cycle. Conclusions Taken together, these findings suggest that the mycofactocin precursor is modified by the Rv0693 family rSAM protein and other enzymes in its cluster. It becomes an electron carrier molecule that serves in vivo as NDMA and other artificial electron acceptors do in vitro. Subclasses from three different nicotinoprotein families show "only-if" relationships to mycofactocin because they require its presence. This framework suggests a segregated redox pool in which mycofactocin mediates communication among enzymes with non-exchangeable cofactors.
Collapse
|
29
|
A comparison of methanobactins from Methylosinus trichosporium OB3b and Methylocystis strain Sb2 predicts methanobactins are synthesized from diverse peptide precursors modified to create a common core for binding and reducing copper ions. Biochemistry 2010; 49:10117-30. [PMID: 20961038 DOI: 10.1021/bi1014375] [Citation(s) in RCA: 76] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Methanobactins (mb) are low-molecular mass, copper-binding molecules secreted by most methanotrophic bacteria. These molecules have been identified for a number of methanotrophs, but only the one produced by Methylosinus trichosporium OB3b (mb-OB3b) has to date been chemically characterized. Here we report the chemical characterization and copper binding properties of a second methanobactin, which is produced by Methylocystis strain SB2 (mb-SB2). mb-SB2 shows some significant similarities to mb-OB3b, including its spectral and metal binding properties, and its ability to bind and reduce Cu(II) to Cu(I). Like mb-OB3b, mb-SB2 contains two five-member heterocyclic rings with associated enethiol groups, which together form the copper ion binding site. mb-SB2 also displays some significant differences compared to mb-OB3b, including the number and types of amino acids used to complete the structure of the molecule, the presence of an imidazolone ring in place of one of the oxazolone rings found in mb-OB3b, and the presence of a sulfate group not found in mb-OB3b. The sulfate is bonded to a threonine-like side chain that is associated with one of the heterocyclic rings and may represent the first example of this type of sulfate group found in a bacterially derived peptide. Acid-catalyzed hydrolysis and decarboxylation of the oxazolone rings found in mb-OB3b and mb-SB2 produce pairs of amino acid residues and suggest that both mb-OB3b and mb-SB2 are derived from peptides. In support of this, the gene for a ribosomally produced peptide precursor for mb-OB3b has been identified in the genome of M. trichosporium OB3b. The gene sequence indicates that the oxazolone rings in mb-OB3b are derived from the combination of a cysteine residue and the carbonyl from the preceding residue in the peptide sequence. Taken together, the results suggest methanobactins make up a structurally diverse group of ribosomally produced, peptide-derived molecules, which share a common pair of five-member rings with associated enethiol groups that are able to bind, reduce, and stabilize copper ions in an aqueous environment.
Collapse
|
30
|
Expansion of ribosomally produced natural products: a nitrile hydratase- and Nif11-related precursor family. BMC Biol 2010; 8:70. [PMID: 20500830 PMCID: PMC2887384 DOI: 10.1186/1741-7007-8-70] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2010] [Accepted: 05/25/2010] [Indexed: 11/18/2022] Open
Abstract
Background A new family of natural products has been described in which cysteine, serine and threonine from ribosomally-produced peptides are converted to thiazoles, oxazoles and methyloxazoles, respectively. These metabolites and their biosynthetic gene clusters are now referred to as thiazole/oxazole-modified microcins (TOMM). As exemplified by microcin B17 and streptolysin S, TOMM precursors contain an N-terminal leader sequence and C-terminal core peptide. The leader sequence contains binding sites for the posttranslational modifying enzymes which subsequently act upon the core peptide. TOMM peptides are small and highly variable, frequently missed by gene-finders and occasionally situated far from the thiazole/oxazole forming genes. Thus, locating a substrate for a particular TOMM pathway can be a challenging endeavor. Results Examination of candidate TOMM precursors has revealed a subclass with an uncharacteristically long leader sequence closely related to the enzyme nitrile hydratase. Members of this nitrile hydratase leader peptide (NHLP) family lack the metal-binding residues required for catalysis. Instead, NHLP sequences display the classic Gly-Gly cleavage motif and have C-terminal regions rich in heterocyclizable residues. The NHLP family exhibits a correlated species distribution and local clustering with an ABC transport system. This study also provides evidence that a separate family, annotated as Nif11 nitrogen-fixing proteins, can serve as natural product precursors (N11P), but not always of the TOMM variety. Indeed, a number of cyanobacterial genomes show extensive N11P paralogous expansion, such as Nostoc, Prochlorococcus and Cyanothece, which replace the TOMM cluster with lanthionine biosynthetic machinery. Conclusions This study has united numerous TOMM gene clusters with their cognate substrates. These results suggest that two large protein families, the nitrile hydratases and Nif11, have been retailored for secondary metabolism. Precursors for TOMMs and lanthionine-containing peptides derived from larger proteins to which other functions are attributed, may be widespread. The functions of these natural products have yet to be elucidated, but it is probable that some will display valuable industrial or medical activities.
Collapse
|
31
|
Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL): adapting the Partial Phylogenetic Profiling algorithm to scan sequences for signatures that predict protein function. BMC Bioinformatics 2010; 11:52. [PMID: 20102603 PMCID: PMC3098086 DOI: 10.1186/1471-2105-11-52] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2009] [Accepted: 01/26/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets. RESULTS Here we illustrate and validate the ability of the SIMBAL method to find functionally relevant short sequence signatures by application to two well-characterized protein families. In the first example, we partitioned a family of ABC permeases using a metabolic background property (urea utilization). Thus, the TRUE set for this family comprised members whose genome of origin encoded a urea utilization system. By moving a sliding window across the sequence of a permease, and searching each subsequence in turn against the full set of partitioned proteins, the method found which local sequence signatures best correlated with the urea utilization trait. Mapping of SIMBAL "hot spots" onto crystal structures of homologous permeases reveals that the significant sites are gating determinants on the cytosolic face rather than, say, docking sites for the substrate-binding protein on the extracellular face. In the second example, we partitioned a protein methyltransferase family using gene proximity as a criterion. In this case, the TRUE set comprised those methyltransferases encoded near the gene for the substrate RF-1. SIMBAL identifies sequence regions that map onto the substrate-binding interface while ignoring regions involved in the methyltransferase reaction mechanism in general. Neither method for training set construction requires any prior experimental characterization. CONCLUSIONS SIMBAL shows that, in functionally divergent protein families, selected short sequences often significantly outperform their full-length parent sequence for making functional predictions by sequence similarity, suggesting avenues for improved functional classifiers. When combined with structural data, SIMBAL affords the ability to localize and model functional sites.
Collapse
|
32
|
A strain-variable bacteriocin in Bacillus anthracis and Bacillus cereus with repeated Cys-Xaa-Xaa motifs. Biol Direct 2009; 4:15. [PMID: 19383135 PMCID: PMC2674419 DOI: 10.1186/1745-6150-4-15] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2009] [Accepted: 04/21/2009] [Indexed: 03/29/2023] Open
Abstract
Bacteriocins are peptide antibiotics from ribosomally translated precursors, produced by bacteria often through extensive post-translational modification. Minimal sequence conservation, short gene lengths, and low complexity sequence can hinder bacteriocin identification, even during gene calling, so they are often discovered by proximity to accessory genes encoding maturation, immunity, and export functions. This work reports a new subfamily of putative thiazole-containing heterocyclic bacteriocins. It appears universal in all strains of Bacillus anthracis and B. cereus, but has gone unrecognized because it is always encoded far from its maturation protein operon. Patterns of insertions and deletions among twenty-four variants suggest a repeating functional unit of Cys-Xaa-Xaa. This article was reviewed by Andrei Osterman and Lakshminarayan Iyer.
Collapse
|
33
|
Orphan SelD proteins and selenium-dependent molybdenum hydroxylases. Biol Direct 2008; 3:4. [PMID: 18289380 PMCID: PMC2276186 DOI: 10.1186/1745-6150-3-4] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2008] [Accepted: 02/20/2008] [Indexed: 11/24/2022] Open
Abstract
Bacterial and Archaeal cells use selenium structurally in selenouridine-modified tRNAs, in proteins translated with selenocysteine, and in the selenium-dependent molybdenum hydroxylases (SDMH). The first two uses both require the selenophosphate synthetase gene, selD. Examining over 500 complete prokaryotic genomes finds selD in exactly two species lacking both the selenocysteine and selenouridine systems, Enterococcus faecalis and Haloarcula marismortui. Surrounding these orphan selD genes, forming bidirectional best hits between species, and detectable by Partial Phylogenetic Profiling vs. selD, are several candidate molybdenum hydroxylase subunits and accessory proteins. We propose that certain accessory proteins, and orphan selD itself, are markers through which new selenium-dependent molybdenum hydroxylases can be found.
Collapse
|
34
|
TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res 2006; 35:D260-4. [PMID: 17151080 PMCID: PMC1781115 DOI: 10.1093/nar/gkl1043] [Citation(s) in RCA: 229] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
TIGRFAMs is a collection of protein family definitions built to aid in high-throughput annotation of specific protein functions. Each family is based on a hidden Markov model (HMM), where both cutoff scores and membership in the seed alignment are chosen so that the HMMs can classify numerous proteins according to their specific molecular functions. Most TIGRFAMs models describe ‘equivalog’ families, where both orthology and lateral gene transfer may be part of the evolutionary history, but where a single molecular function has been conserved. The Genome Properties system contains a queriable set of metabolic reconstructions, genome metrics and extractions of information from the scientific literature. Its genome-by-genome assertions of whether or not specific structures, pathways or systems are present provide high-level conceptual descriptions of genomic content. These assertions enable comparative genomics, provide a meaningful biological context to aid in manual annotation, support assignments of Gene Ontology (GO) biological process terms and help validate HMM-based predictions of protein function. The Genome Properties system is particularly useful as a generator of phylogenetic profiles, through which new protein family functions may be discovered. The TIGRFAMs and Genome Properties systems can be accessed at and .
Collapse
|
35
|
Comparative genomic evidence for a close relationship between the dimorphic prosthecate bacteria Hyphomonas neptunium and Caulobacter crescentus. J Bacteriol 2006; 188:6841-50. [PMID: 16980487 PMCID: PMC1595504 DOI: 10.1128/jb.00111-06] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
The dimorphic prosthecate bacteria (DPB) are alpha-proteobacteria that reproduce in an asymmetric manner rather than by binary fission and are of interest as simple models of development. Prior to this work, the only member of this group for which genome sequence was available was the model freshwater organism Caulobacter crescentus. Here we describe the genome sequence of Hyphomonas neptunium, a marine member of the DPB that differs from C. crescentus in that H. neptunium uses its stalk as a reproductive structure. Genome analysis indicates that this organism shares more genes with C. crescentus than it does with Silicibacter pomeroyi (a closer relative according to 16S rRNA phylogeny), that it relies upon a heterotrophic strategy utilizing a wide range of substrates, that its cell cycle is likely to be regulated in a similar manner to that of C. crescentus, and that the outer membrane complements of H. neptunium and C. crescentus are remarkably similar. H. neptunium swarmer cells are highly motile via a single polar flagellum. With the exception of cheY and cheR, genes required for chemotaxis were absent in the H. neptunium genome. Consistent with this observation, H. neptunium swarmer cells did not respond to any chemotactic stimuli that were tested, which suggests that H. neptunium motility is a random dispersal mechanism for swarmer cells rather than a stimulus-controlled navigation system for locating specific environments. In addition to providing insights into bacterial development, the H. neptunium genome will provide an important resource for the study of other interesting biological processes including chromosome segregation, polar growth, and cell aging.
Collapse
|
36
|
Exopolysaccharide-associated protein sorting in environmental organisms: the PEP-CTERM/EpsH system. Application of a novel phylogenetic profiling heuristic. BMC Biol 2006; 4:29. [PMID: 16930487 PMCID: PMC1569441 DOI: 10.1186/1741-7007-4-29] [Citation(s) in RCA: 68] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2006] [Accepted: 08/24/2006] [Indexed: 11/13/2022] Open
Abstract
Background Protein translocation to the proper cellular destination may be guided by various classes of sorting signals recognizable in the primary sequence. Detection in some genomes, but not others, may reveal sorting system components by comparison of the phylogenetic profile of the class of sorting signal to that of various protein families. Results We describe a short C-terminal homology domain, sporadically distributed in bacteria, with several key characteristics of protein sorting signals. The domain includes a near-invariant motif Pro-Glu-Pro (PEP). This possible recognition or processing site is followed by a predicted transmembrane helix and a cluster rich in basic amino acids. We designate this domain PEP-CTERM. It tends to occur multiple times in a genome if it occurs at all, with a median count of eight instances; Verrucomicrobium spinosum has sixty-five. PEP-CTERM-containing proteins generally contain an N-terminal signal peptide and exhibit high diversity and little homology to known proteins. All bacteria with PEP-CTERM have both an outer membrane and exopolysaccharide (EPS) production genes. By a simple heuristic for screening phylogenetic profiles in the absence of pre-formed protein families, we discovered that a homolog of the membrane protein EpsH (exopolysaccharide locus protein H) occurs in a species when PEP-CTERM domains are found. The EpsH family contains invariant residues consistent with a transpeptidase function. Most PEP-CTERM proteins are encoded by single-gene operons preceded by large intergenic regions. In the Proteobacteria, most of these upstream regions share a DNA sequence, a probable cis-regulatory site that contains a sigma-54 binding motif. The phylogenetic profile for this DNA sequence exactly matches that of three proteins: a sigma-54-interacting response regulator (PrsR), a transmembrane histidine kinase (PrsK), and a TPR protein (PrsT). Conclusion These findings are consistent with the hypothesis that PEP-CTERM and EpsH form a protein export sorting system, analogous to the LPXTG/sortase system of Gram-positive bacteria, and correlated to EPS expression. It occurs preferentially in bacteria from sediments, soils, and biofilms. The novel method that led to these findings, partial phylogenetic profiling, requires neither global sequence clustering nor arbitrary similarity cutoffs and appears to be a rapid, effective alternative to other profiling methods.
Collapse
|
37
|
Skewed genomic variability in strains of the toxigenic bacterial pathogen, Clostridium perfringens. Genome Res 2006; 16:1031-40. [PMID: 16825665 PMCID: PMC1524862 DOI: 10.1101/gr.5238106] [Citation(s) in RCA: 240] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Clostridium perfringens is a Gram-positive, anaerobic spore-forming bacterium commonly found in soil, sediments, and the human gastrointestinal tract. C. perfringens is responsible for a wide spectrum of disease, including food poisoning, gas gangrene (clostridial myonecrosis), enteritis necroticans, and non-foodborne gastrointestinal infections. The complete genome sequences of Clostridium perfringens strain ATCC 13124, a gas gangrene isolate and the species type strain, and the enterotoxin-producing food poisoning strain SM101, were determined and compared with the published C. perfringens strain 13 genome. Comparison of the three genomes revealed considerable genomic diversity with >300 unique "genomic islands" identified, with the majority of these islands unusually clustered on one replichore. PCR-based analysis indicated that the large genomic islands are widely variable across a large collection of C. perfringens strains. These islands encode genes that correlate to differences in virulence and phenotypic characteristics of these strains. Significant differences between the strains include numerous novel mobile elements and genes encoding metabolic capabilities, strain-specific extracellular polysaccharide capsule, sporulation factors, toxins, and other secreted enzymes, providing substantial insight into this medically important bacterial pathogen.
Collapse
|
38
|
Abstract
Anaplasma (formerly Ehrlichia) phagocytophilum, Ehrlichia chaffeensis, and Neorickettsia (formerly Ehrlichia) sennetsu are intracellular vector-borne pathogens that cause human ehrlichiosis, an emerging infectious disease. We present the complete genome sequences of these organisms along with comparisons to other organisms in the Rickettsiales order. Ehrlichia spp. and Anaplasma spp. display a unique large expansion of immunodominant outer membrane proteins facilitating antigenic variation. All Rickettsiales have a diminished ability to synthesize amino acids compared to their closest free-living relatives. Unlike members of the Rickettsiaceae family, these pathogenic Anaplasmataceae are capable of making all major vitamins, cofactors, and nucleotides, which could confer a beneficial role in the invertebrate vector or the vertebrate host. Further analysis identified proteins potentially involved in vacuole confinement of the Anaplasmataceae, a life cycle involving a hematophagous vector, vertebrate pathogenesis, human pathogenesis, and lack of transovarial transmission. These discoveries provide significant insights into the biology of these obligate intracellular pathogens. Ehrlichiosis is an acute disease that triggers flu-like symptoms in both humans and animals. It is caused by a range of bacteria transmitted by ticks or flukes. Because these bacteria are difficult to culture, however, the organisms are poorly understood. The genomes of three emerging human pathogens causing ehrlichiosis were sequenced. A database was designed to allow the comparison of these three genomes to sixteen other bacteria with similar lifestyles. Analysis from this database reveals new species-specific and disease-specific genes indicating niche adaptations, pathogenic traits, and other features. In particular, one of the organisms contains more than 100 copies of a single gene involved in interactions with the host(s). These comparisons also enabled a reconstruction of the metabolic potential of five representative genomes from these bacteria and their close relatives. With this work, scientists can study these emerging pathogens in earnest.
Collapse
|
39
|
A guild of 45 CRISPR-associated (Cas) protein families and multiple CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS Comput Biol 2005; 1:e60. [PMID: 16292354 PMCID: PMC1282333 DOI: 10.1371/journal.pcbi.0010060] [Citation(s) in RCA: 683] [Impact Index Per Article: 35.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2005] [Accepted: 10/05/2005] [Indexed: 11/19/2022] Open
Abstract
Clustered regularly interspaced short palindromic repeats (CRISPRs) are a family of DNA direct repeats found in many prokaryotic genomes. Repeats of 21-37 bp typically show weak dyad symmetry and are separated by regularly sized, nonrepetitive spacer sequences. Four CRISPR-associated (Cas) protein families, designated Cas1 to Cas4, are strictly associated with CRISPR elements and always occur near a repeat cluster. Some spacers originate from mobile genetic elements and are thought to confer "immunity" against the elements that harbor these sequences. In the present study, we have systematically investigated uncharacterized proteins encoded in the vicinity of these CRISPRs and found many additional protein families that are strictly associated with CRISPR loci across multiple prokaryotic species. Multiple sequence alignments and hidden Markov models have been built for 45 Cas protein families. These models identify family members with high sensitivity and selectivity and classify key regulators of development, DevR and DevS, in Myxococcus xanthus as Cas proteins. These identifications show that CRISPR/cas gene regions can be quite large, with up to 20 different, tandem-arranged cas genes next to a repeat cluster or filling the region between two repeat clusters. Distinctive subsets of the collection of Cas proteins recur in phylogenetically distant species and correlate with characteristic repeat periodicity. The analyses presented here support initial proposals of mobility of these units, along with the likelihood that loci of different subtypes interact with one another as well as with host cell defensive, replicative, and regulatory systems. It is evident from this analysis that CRISPR/cas loci are larger, more complex, and more heterogeneous than previously appreciated.
Collapse
|
40
|
Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sci U S A 2005; 102:13950-5. [PMID: 16172379 PMCID: PMC1216834 DOI: 10.1073/pnas.0506758102] [Citation(s) in RCA: 1526] [Impact Index Per Article: 80.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The development of efficient and inexpensive genome sequencing methods has revolutionized the study of human bacterial pathogens and improved vaccine design. Unfortunately, the sequence of a single genome does not reflect how genetic variability drives pathogenesis within a bacterial species and also limits genome-wide screens for vaccine candidates or for antimicrobial targets. We have generated the genomic sequence of six strains representing the five major disease-causing serotypes of Streptococcus agalactiae, the main cause of neonatal infection in humans. Analysis of these genomes and those available in databases showed that the S. agalactiae species can be described by a pan-genome consisting of a core genome shared by all isolates, accounting for approximately 80% of any single genome, plus a dispensable genome consisting of partially shared and strain-specific genes. Mathematical extrapolation of the data suggests that the gene reservoir available for inclusion in the S. agalactiae pan-genome is vast and that unique genes will continue to be identified even after sequencing hundreds of genomes.
Collapse
|
41
|
Insights on evolution of virulence and resistance from the complete genome analysis of an early methicillin-resistant Staphylococcus aureus strain and a biofilm-producing methicillin-resistant Staphylococcus epidermidis strain. J Bacteriol 2005; 187:2426-38. [PMID: 15774886 PMCID: PMC1065214 DOI: 10.1128/jb.187.7.2426-2438.2005] [Citation(s) in RCA: 752] [Impact Index Per Article: 39.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2004] [Accepted: 12/13/2004] [Indexed: 01/27/2023] Open
Abstract
Staphylococcus aureus is an opportunistic pathogen and the major causative agent of numerous hospital- and community-acquired infections. Staphylococcus epidermidis has emerged as a causative agent of infections often associated with implanted medical devices. We have sequenced the approximately 2.8-Mb genome of S. aureus COL, an early methicillin-resistant isolate, and the approximately 2.6-Mb genome of S. epidermidis RP62a, a methicillin-resistant biofilm isolate. Comparative analysis of these and other staphylococcal genomes was used to explore the evolution of virulence and resistance between these two species. The S. aureus and S. epidermidis genomes are syntenic throughout their lengths and share a core set of 1,681 open reading frames. Genome islands in nonsyntenic regions are the primary source of variations in pathogenicity and resistance. Gene transfer between staphylococci and low-GC-content gram-positive bacteria appears to have shaped their virulence and resistance profiles. Integrated plasmids in S. epidermidis carry genes encoding resistance to cadmium and species-specific LPXTG surface proteins. A novel genome island encodes multiple phenol-soluble modulins, a potential S. epidermidis virulence factor. S. epidermidis contains the cap operon, encoding the polyglutamate capsule, a major virulence factor in Bacillus anthracis. Additional phenotypic differences are likely the result of single nucleotide polymorphisms, which are most numerous in cell envelope proteins. Overall differences in pathogenicity can be attributed to genome islands in S. aureus which encode enterotoxins, exotoxins, leukocidins, and leukotoxins not found in S. epidermidis.
Collapse
|
42
|
Genome sequence of Silicibacter pomeroyi reveals adaptations to the marine environment. Nature 2004; 432:910-3. [PMID: 15602564 DOI: 10.1038/nature03170] [Citation(s) in RCA: 316] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2004] [Accepted: 11/01/2004] [Indexed: 11/08/2022]
Abstract
Since the recognition of prokaryotes as essential components of the oceanic food web, bacterioplankton have been acknowledged as catalysts of most major biogeochemical processes in the sea. Studying heterotrophic bacterioplankton has been challenging, however, as most major clades have never been cultured or have only been grown to low densities in sea water. Here we describe the genome sequence of Silicibacter pomeroyi, a member of the marine Roseobacter clade (Fig. 1), the relatives of which comprise approximately 10-20% of coastal and oceanic mixed-layer bacterioplankton. This first genome sequence from any major heterotrophic clade consists of a chromosome (4,109,442 base pairs) and megaplasmid (491,611 base pairs). Genome analysis indicates that this organism relies upon a lithoheterotrophic strategy that uses inorganic compounds (carbon monoxide and sulphide) to supplement heterotrophy. Silicibacter pomeroyi also has genes advantageous for associations with plankton and suspended particles, including genes for uptake of algal-derived compounds, use of metabolites from reducing microzones, rapid growth and cell-density-dependent regulation. This bacterium has a physiology distinct from that of marine oligotrophs, adding a new strategy to the recognized repertoire for coping with a nutrient-poor ocean.
Collapse
|
43
|
Abstract
The complete genome sequence of Burkholderia mallei ATCC 23344 provides insight into this highly infectious bacterium's pathogenicity and evolutionary history. B. mallei, the etiologic agent of glanders, has come under renewed scientific investigation as a result of recent concerns about its past and potential future use as a biological weapon. Genome analysis identified a number of putative virulence factors whose function was supported by comparative genome hybridization and expression profiling of the bacterium in hamster liver in vivo. The genome contains numerous insertion sequence elements that have mediated extensive deletions and rearrangements of the genome relative to Burkholderia pseudomallei. The genome also contains a vast number (>12,000) of simple sequence repeats. Variation in simple sequence repeats in key genes can provide a mechanism for generating antigenic variation that may account for the mammalian host's inability to mount a durable adaptive immune response to a B. mallei infection.
Collapse
|
44
|
Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics. Bioinformatics 2004; 21:293-306. [PMID: 15347579 DOI: 10.1093/bioinformatics/bti015] [Citation(s) in RCA: 67] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The presence or absence of metabolic pathways and structures provide a context that makes protein annotation far more reliable. Compiling such information across microbial genomes improves the functional classification of proteins and provides a valuable resource for comparative genomics. RESULTS We have created a Genome Properties system to present key aspects of prokaryotic biology using standardized computational methods and controlled vocabularies. Properties reflect gene content, phenotype, phylogeny and computational analyses. The results of searches using hidden Markov models allow many properties to be deduced automatically, especially for families of proteins (equivalogs) conserved in function since their last common ancestor. Additional properties are derived from curation, published reports and other forms of evidence. Genome Properties system was applied to 156 complete prokaryotic genomes, and is easily mined to find differences between species, correlations between metabolic features and families of uncharacterized proteins, or relationships among properties. AVAILABILITY Genome Properties can be found at http://www.tigr.org/Genome_Properties SUPPLEMENTARY INFORMATION http://www.tigr.org/tigr-scripts/CMR2/genome_properties_references.spl.
Collapse
|
45
|
Whole genome comparisons of serotype 4b and 1/2a strains of the food-borne pathogen Listeria monocytogenes reveal new insights into the core genome components of this species. Nucleic Acids Res 2004; 32:2386-95. [PMID: 15115801 PMCID: PMC419451 DOI: 10.1093/nar/gkh562] [Citation(s) in RCA: 375] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2003] [Revised: 03/21/2004] [Accepted: 04/01/2004] [Indexed: 11/14/2022] Open
Abstract
The genomes of three strains of Listeria monocytogenes that have been associated with food-borne illness in the USA were subjected to whole genome comparative analysis. A total of 51, 97 and 69 strain-specific genes were identified in L.monocytogenes strains F2365 (serotype 4b, cheese isolate), F6854 (serotype 1/2a, frankfurter isolate) and H7858 (serotype 4b, meat isolate), respectively. Eighty-three genes were restricted to serotype 1/2a and 51 to serotype 4b strains. These strain- and serotype-specific genes probably contribute to observed differences in pathogenicity, and the ability of the organisms to survive and grow in their respective environmental niches. The serotype 1/2a-specific genes include an operon that encodes the rhamnose biosynthetic pathway that is associated with teichoic acid biosynthesis, as well as operons for five glycosyl transferases and an adenine-specific DNA methyltransferase. A total of 8603 and 105 050 high quality single nucleotide polymorphisms (SNPs) were found on the draft genome sequences of strain H7858 and strain F6854, respectively, when compared with strain F2365. Whole genome comparative analyses revealed that the L.monocytogenes genomes are essentially syntenic, with the majority of genomic differences consisting of phage insertions, transposable elements and SNPs.
Collapse
|
46
|
The genome sequence of the anaerobic, sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough. Nat Biotechnol 2004; 22:554-9. [PMID: 15077118 DOI: 10.1038/nbt959] [Citation(s) in RCA: 407] [Impact Index Per Article: 20.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2003] [Accepted: 03/03/2004] [Indexed: 11/09/2022]
Abstract
Desulfovibrio vulgaris Hildenborough is a model organism for studying the energy metabolism of sulfate-reducing bacteria (SRB) and for understanding the economic impacts of SRB, including biocorrosion of metal infrastructure and bioremediation of toxic metal ions. The 3,570,858 base pair (bp) genome sequence reveals a network of novel c-type cytochromes, connecting multiple periplasmic hydrogenases and formate dehydrogenases, as a key feature of its energy metabolism. The relative arrangement of genes encoding enzymes for energy transduction, together with inferred cellular location of the enzymes, provides a basis for proposing an expansion to the 'hydrogen-cycling' model for increasing energy efficiency in this bacterium. Plasmid-encoded functions include modification of cell surface components, nitrogen fixation and a type-III protein secretion system. This genome sequence represents a substantial step toward the elucidation of pathways for reduction (and bioremediation) of pollutants such as uranium and chromium and offers a new starting point for defining this organism's complex anaerobic respiration.
Collapse
|
47
|
Abstract
The complete genome sequence of Geobacter sulfurreducens, a delta-proteobacterium, reveals unsuspected capabilities, including evidence of aerobic metabolism, one-carbon and complex carbon metabolism, motility, and chemotactic behavior. These characteristics, coupled with the possession of many two-component sensors and many c-type cytochromes, reveal an ability to create alternative, redundant, electron transport networks and offer insights into the process of metal ion reduction in subsurface environments. As well as playing roles in the global cycling of metals and carbon, this organism clearly has the potential for use in bioremediation of radioactive metals and in the generation of electricity.
Collapse
|
48
|
The complete genome sequence of the Arabidopsis and tomato pathogen Pseudomonas syringae pv. tomato DC3000. Proc Natl Acad Sci U S A 2003; 100:10181-6. [PMID: 12928499 PMCID: PMC193536 DOI: 10.1073/pnas.1731982100] [Citation(s) in RCA: 607] [Impact Index Per Article: 28.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2003] [Indexed: 12/24/2022] Open
Abstract
We report the complete genome sequence of the model bacterial pathogen Pseudomonas syringae pathovar tomato DC3000 (DC3000), which is pathogenic on tomato and Arabidopsis thaliana. The DC3000 genome (6.5 megabases) contains a circular chromosome and two plasmids, which collectively encode 5,763 ORFs. We identified 298 established and putative virulence genes, including several clusters of genes encoding 31 confirmed and 19 predicted type III secretion system effector proteins. Many of the virulence genes were members of paralogous families and also were proximal to mobile elements, which collectively comprise 7% of the DC3000 genome. The bacterium possesses a large repertoire of transporters for the acquisition of nutrients, particularly sugars, as well as genes implicated in attachment to plant surfaces. Over 12% of the genes are dedicated to regulation, which may reflect the need for rapid adaptation to the diverse environments encountered during epiphytic growth and pathogenesis. Comparative analyses confirmed a high degree of similarity with two sequenced pseudomonads, Pseudomonas putida and Pseudomonas aeruginosa, yet revealed 1,159 genes unique to DC3000, of which 811 lack a known function.
Collapse
|
49
|
Abstract
The complete 2,343,479-bp genome sequence of the gram-negative, pathogenic oral bacterium Porphyromonas gingivalis strain W83, a major contributor to periodontal disease, was determined. Whole-genome comparative analysis with other available complete genome sequences confirms the close relationship between the Cytophaga-Flavobacteria-Bacteroides (CFB) phylum and the green-sulfur bacteria. Within the CFB phyla, the genomes most similar to that of P. gingivalis are those of Bacteroides thetaiotaomicron and B. fragilis. Outside of the CFB phyla the most similar genome to P. gingivalis is that of Chlorobium tepidum, supporting the previous phylogenetic studies that indicated that the Chlorobia and CFB phyla are related, albeit distantly. Genome analysis of strain W83 reveals a range of pathways and virulence determinants that relate to the novel biology of this oral pathogen. Among these determinants are at least six putative hemagglutinin-like genes and 36 previously unidentified peptidases. Genome analysis also reveals that P. gingivalis can metabolize a range of amino acids and generate a number of metabolic end products that are toxic to the human host or human gingival tissue and contribute to the development of periodontal disease.
Collapse
|
50
|
The genome sequence of Bacillus anthracis Ames and comparison to closely related bacteria. Nature 2003; 423:81-6. [PMID: 12721629 DOI: 10.1038/nature01586] [Citation(s) in RCA: 663] [Impact Index Per Article: 31.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2002] [Accepted: 03/28/2003] [Indexed: 11/09/2022]
Abstract
Bacillus anthracis is an endospore-forming bacterium that causes inhalational anthrax. Key virulence genes are found on plasmids (extra-chromosomal, circular, double-stranded DNA molecules) pXO1 (ref. 2) and pXO2 (ref. 3). To identify additional genes that might contribute to virulence, we analysed the complete sequence of the chromosome of B. anthracis Ames (about 5.23 megabases). We found several chromosomally encoded proteins that may contribute to pathogenicity--including haemolysins, phospholipases and iron acquisition functions--and identified numerous surface proteins that might be important targets for vaccines and drugs. Almost all these putative chromosomal virulence and surface proteins have homologues in Bacillus cereus, highlighting the similarity of B. anthracis to near-neighbours that are not associated with anthrax. By performing a comparative genome hybridization of 19 B. cereus and Bacillus thuringiensis strains against a B. anthracis DNA microarray, we confirmed the general similarity of chromosomal genes among this group of close relatives. However, we found that the gene sequences of pXO1 and pXO2 were more variable between strains, suggesting plasmid mobility in the group. The complete sequence of B. anthracis is a step towards a better understanding of anthrax pathogenesis.
Collapse
|