1
|
VanOeffelen M, Nguyen M, Aytan-Aktug D, Brettin T, Dietrich EM, Kenyon RW, Machi D, Mao C, Olson R, Pusch GD, Shukla M, Stevens R, Vonstein V, Warren AS, Wattam AR, Yoo H, Davis JJ. A genomic data resource for predicting antimicrobial resistance from laboratory-derived antimicrobial susceptibility phenotypes. Brief Bioinform 2021; 22:bbab313. [PMID: 34379107 PMCID: PMC8575023 DOI: 10.1093/bib/bbab313] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/18/2021] [Accepted: 07/20/2021] [Indexed: 11/14/2022] Open
Abstract
Antimicrobial resistance (AMR) is a major global health threat that affects millions of people each year. Funding agencies worldwide and the global research community have expended considerable capital and effort tracking the evolution and spread of AMR by isolating and sequencing bacterial strains and performing antimicrobial susceptibility testing (AST). For the last several years, we have been capturing these efforts by curating data from the literature and data resources and building a set of assembled bacterial genome sequences that are paired with laboratory-derived AST data. This collection currently contains AST data for over 67 000 genomes encompassing approximately 40 genera and over 100 species. In this paper, we describe the characteristics of this collection, highlighting areas where sampling is comparatively deep or shallow, and showing areas where attention is needed from the research community to improve sampling and tracking efforts. In addition to using the data to track the evolution and spread of AMR, it also serves as a useful starting point for building machine learning models for predicting AMR phenotypes. We demonstrate this by describing two machine learning models that are built from the entire dataset to show where the predictive power is comparatively high or low. This AMR metadata collection is freely available and maintained on the Bacterial and Viral Bioinformatics Center (BV-BRC) FTP site ftp://ftp.bvbrc.org/RELEASE_NOTES/PATRIC_genomes_AMR.txt.
Collapse
Affiliation(s)
| | - Marcus Nguyen
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA
- Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA
| | - Derya Aytan-Aktug
- National Food Institute, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Thomas Brettin
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA
- Computing Environment and Life Sciences, Argonne National Laboratory, Argonne, IL, USA
| | - Emily M Dietrich
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA
- Computing Environment and Life Sciences, Argonne National Laboratory, Argonne, IL, USA
| | - Ronald W Kenyon
- Biocomplexity Institute and Initiative, University of Virginia, Virginia, USA
| | - Dustin Machi
- Biocomplexity Institute and Initiative, University of Virginia, Virginia, USA
| | - Chunhong Mao
- Biocomplexity Institute and Initiative, University of Virginia, Virginia, USA
| | - Robert Olson
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA
- Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA
| | - Gordon D Pusch
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, USA
| | - Maulik Shukla
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA
- Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA
| | - Rick Stevens
- Computing Environment and Life Sciences, Argonne National Laboratory, Argonne, IL, USA
- Department of Computer Science, University of Chicago, Chicago, IL, USA
| | | | - Andrew S Warren
- Biocomplexity Institute and Initiative, University of Virginia, Virginia, USA
| | - Alice R Wattam
- Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA
- Biocomplexity Institute and Initiative, University of Virginia, Virginia, USA
| | - Hyunseung Yoo
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA
- Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA
| | - James J Davis
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL, USA
- Data Science and Learning Division, Argonne National Laboratory, Argonne, IL, USA
- Northwestern Argonne Institute for Science and Engineering, Evanston, IL, USA
| |
Collapse
|
2
|
Antonopoulos DA, Assaf R, Aziz RK, Brettin T, Bun C, Conrad N, Davis JJ, Dietrich EM, Disz T, Gerdes S, Kenyon RW, Machi D, Mao C, Murphy-Olson DE, Nordberg EK, Olsen GJ, Olson R, Overbeek R, Parrello B, Pusch GD, Santerre J, Shukla M, Stevens RL, VanOeffelen M, Vonstein V, Warren AS, Wattam AR, Xia F, Yoo H. PATRIC as a unique resource for studying antimicrobial resistance. Brief Bioinform 2020; 20:1094-1102. [PMID: 28968762 PMCID: PMC6781570 DOI: 10.1093/bib/bbx083] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2017] [Revised: 06/13/2017] [Indexed: 02/07/2023] Open
Abstract
The Pathosystems Resource Integration Center (PATRIC, www.patricbrc.org) is designed to provide researchers with the tools and services that they need to perform genomic and other ‘omic’ data analyses. In response to mounting concern over antimicrobial resistance (AMR), the PATRIC team has been developing new tools that help researchers understand AMR and its genetic determinants. To support comparative analyses, we have added AMR phenotype data to over 15 000 genomes in the PATRIC database, often assembling genomes from reads in public archives and collecting their associated AMR panel data from the literature to augment the collection. We have also been using this collection of AMR metadata to build machine learning-based classifiers that can predict the AMR phenotypes and the genomic regions associated with resistance for genomes being submitted to the annotation service. Likewise, we have undertaken a large AMR protein annotation effort by manually curating data from the literature and public repositories. This collection of 7370 AMR reference proteins, which contains many protein annotations (functional roles) that are unique to PATRIC and RAST, has been manually curated so that it projects stably across genomes. The collection currently projects to 1 610 744 proteins in the PATRIC database. Finally, the PATRIC Web site has been expanded to enable AMR-based custom page views so that researchers can easily explore AMR data and design experiments based on whole genomes or individual genes.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Alice R Wattam
- Corresponding author: Alice R. Wattam, Biocomplexity Institute of Virginia Tech, 1015 Life Science Circle, Blacksburg, VA 24061 USA. Tel.: 540-231-1263; Fax: 540-231-2606; E-mail:
| | | | | |
Collapse
|
3
|
Davis JJ, Wattam AR, Aziz RK, Brettin T, Butler R, Butler RM, Chlenski P, Conrad N, Dickerman A, Dietrich EM, Gabbard JL, Gerdes S, Guard A, Kenyon RW, Machi D, Mao C, Murphy-Olson D, Nguyen M, Nordberg EK, Olsen GJ, Olson RD, Overbeek JC, Overbeek R, Parrello B, Pusch GD, Shukla M, Thomas C, VanOeffelen M, Vonstein V, Warren AS, Xia F, Xie D, Yoo H, Stevens R. The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities. Nucleic Acids Res 2020; 48:D606-D612. [PMID: 31667520 PMCID: PMC7145515 DOI: 10.1093/nar/gkz943] [Citation(s) in RCA: 401] [Impact Index Per Article: 100.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Revised: 10/07/2019] [Accepted: 10/11/2019] [Indexed: 12/24/2022] Open
Abstract
The PathoSystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center funded by the National Institute of Allergy and Infectious Diseases (https://www.patricbrc.org). PATRIC supports bioinformatic analyses of all bacteria with a special emphasis on pathogens, offering a rich comparative analysis environment that provides users with access to over 250 000 uniformly annotated and publicly available genomes with curated metadata. PATRIC offers web-based visualization and comparative analysis tools, a private workspace in which users can analyze their own data in the context of the public collections, services that streamline complex bioinformatic workflows and command-line tools for bulk data analysis. Over the past several years, as genomic and other omics-related experiments have become more cost-effective and widespread, we have observed considerable growth in the usage of and demand for easy-to-use, publicly available bioinformatic tools and services. Here we report the recent updates to the PATRIC resource, including new web-based comparative analysis tools, eight new services and the release of a command-line interface to access, query and analyze data.
Collapse
Affiliation(s)
- James J Davis
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Alice R Wattam
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
- Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA 22904, USA
| | - Ramy K Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, 11562 Cairo, Egypt
- Center for Genome and Microbiome Research, Cairo University, 11562 Cairo, Egypt
| | - Thomas Brettin
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Computing Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Ralph Butler
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
- Middle Tennessee State University, Murfreesboro, TN 37132, USA
| | - Rory M Butler
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | | | - Neal Conrad
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Allan Dickerman
- Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA 22904, USA
| | - Emily M Dietrich
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Computing Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | | | - Svetlana Gerdes
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA
| | - Andrew Guard
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
| | - Ronald W Kenyon
- Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA 22904, USA
| | - Dustin Machi
- Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA 22904, USA
| | - Chunhong Mao
- Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA 22904, USA
| | - Dan Murphy-Olson
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Computing Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Marcus Nguyen
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Eric K Nordberg
- Transportation Institute, Virginia Tech University, Blacksburg, VA 24061, USA
| | - Gary J Olsen
- Department of Microbiology, University of Illinois, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| | - Robert D Olson
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Jamie C Overbeek
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Ross Overbeek
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
| | - Bruce Parrello
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Gordon D Pusch
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA
| | - Maulik Shukla
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Chris Thomas
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
| | | | | | - Andrew S Warren
- Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA 22904, USA
| | - Fangfang Xia
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Dawen Xie
- Biocomplexity Institute and Initiative, University of Virginia, Charlottesville, VA 22904, USA
| | - Hyunseung Yoo
- University of Chicago Consortium for Advanced Science and Engineering, University of Chicago, Chicago, IL 60637, USA
- Division of Data Science and Learning, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Rick Stevens
- Computing Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
- University of Chicago, Department of Computer Science, Chicago, IL 60637, USA
| |
Collapse
|
4
|
Parrello B, Butler R, Chlenski P, Olson R, Overbeek J, Pusch GD, Vonstein V, Overbeek R. A machine learning-based service for estimating quality of genomes using PATRIC. BMC Bioinformatics 2019; 20:486. [PMID: 31581946 PMCID: PMC6775668 DOI: 10.1186/s12859-019-3068-y] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Accepted: 09/03/2019] [Indexed: 11/10/2022] Open
Abstract
Background Recent advances in high-volume sequencing technology and mining of genomes from metagenomic samples call for rapid and reliable genome quality evaluation. The current release of the PATRIC database contains over 220,000 genomes, and current metagenomic technology supports assemblies of many draft-quality genomes from a single sample, most of which will be novel. Description We have added two quality assessment tools to the PATRIC annotation pipeline. EvalCon uses supervised machine learning to calculate an annotation consistency score. EvalG implements a variant of the CheckM algorithm to estimate contamination and completeness of an annotated genome.We report on the performance of these tools and the potential utility of the consistency score. Additionally, we provide contamination, completeness, and consistency measures for all genomes in PATRIC and in a recent set of metagenomic assemblies. Conclusion EvalG and EvalCon facilitate the rapid quality control and exploration of PATRIC-annotated draft genomes.
Collapse
Affiliation(s)
- Bruce Parrello
- Fellowship for Interpretation of Genomes, Burr Ridge, 60527, IL, USA.,University of Chicago, Chicago, 60637, IL, USA
| | - Rory Butler
- Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory, 4200 S. Cass Avenue, Lemont, 60439, IL, USA
| | - Philippe Chlenski
- Fellowship for Interpretation of Genomes, Burr Ridge, 60527, IL, USA.
| | - Robert Olson
- Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory, 4200 S. Cass Avenue, Lemont, 60439, IL, USA
| | - Jamie Overbeek
- Fellowship for Interpretation of Genomes, Burr Ridge, 60527, IL, USA.,Computing, Environment, and Life Sciences Directorate, Argonne National Laboratory, 4200 S. Cass Avenue, Lemont, 60439, IL, USA
| | - Gordon D Pusch
- Fellowship for Interpretation of Genomes, Burr Ridge, 60527, IL, USA
| | - Veronika Vonstein
- Fellowship for Interpretation of Genomes, Burr Ridge, 60527, IL, USA
| | - Ross Overbeek
- Fellowship for Interpretation of Genomes, Burr Ridge, 60527, IL, USA.,University of Chicago, Chicago, 60637, IL, USA
| |
Collapse
|
5
|
Abstract
Phages are complex biomolecular machineries that have to survive in a bacterial world. Phage genomes show many adaptations to their lifestyle such as shorter genes, reduced capacity for redundant DNA sequences, and the inclusion of tRNAs in their genomes. In addition, phages are not free-living, they require a host for replication and survival. These unique adaptations provide challenges for the bioinformatics analysis of phage genomes. In particular, ORF calling, genome annotation, noncoding RNA (ncRNA) identification, and the identification of transposons and insertions are all complicated in phage genome analysis. We provide a road map through the phage genome annotation pipeline, and discuss the challenges and solutions for phage genome annotation as we have implemented in the rapid annotation using subsystems (RAST) pipeline.
Collapse
Affiliation(s)
- Katelyn McNair
- Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA, 92182, USA
| | - Ramy Karam Aziz
- Department of Microbiology and Immunology, Faculty of Pharmacy, Cairo University, Cairo, 11562, Egypt.,Argonne National Laboratory, 9700 S. Cass Ave, Argonne, IL, 60439, USA
| | - Gordon D Pusch
- Argonne National Laboratory, 9700 S. Cass Ave, Argonne, IL, 60439, USA
| | - Ross Overbeek
- Argonne National Laboratory, 9700 S. Cass Ave, Argonne, IL, 60439, USA
| | - Bas E Dutilh
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584, Utrecht, The Netherlands.,Centre for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboud University Medical Centre, Geert Grooteplein 28, 6525, Nijmegen, The Netherlands
| | - Robert Edwards
- Computational Sciences Research Center, San Diego State University, 5500 Campanile Drive, San Diego, CA, 92182, USA. .,Departments of Biology and Computer Science, San Diego State University, 5500 Campanile Drive, San Diego, CA, 92182, USA.
| |
Collapse
|
6
|
Wattam AR, Davis JJ, Assaf R, Boisvert S, Brettin T, Bun C, Conrad N, Dietrich EM, Disz T, Gabbard JL, Gerdes S, Henry CS, Kenyon RW, Machi D, Mao C, Nordberg EK, Olsen GJ, Murphy-Olson DE, Olson R, Overbeek R, Parrello B, Pusch GD, Shukla M, Vonstein V, Warren A, Xia F, Yoo H, Stevens RL. Improvements to PATRIC, the all-bacterial Bioinformatics Database and Analysis Resource Center. Nucleic Acids Res 2016; 45:D535-D542. [PMID: 27899627 PMCID: PMC5210524 DOI: 10.1093/nar/gkw1017] [Citation(s) in RCA: 1036] [Impact Index Per Article: 129.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Revised: 10/14/2016] [Accepted: 11/09/2016] [Indexed: 12/14/2022] Open
Abstract
The Pathosystems Resource Integration Center (PATRIC) is the bacterial Bioinformatics Resource Center (https://www.patricbrc.org). Recent changes to PATRIC include a redesign of the web interface and some new services that provide users with a platform that takes them from raw reads to an integrated analysis experience. The redesigned interface allows researchers direct access to tools and data, and the emphasis has changed to user-created genome-groups, with detailed summaries and views of the data that researchers have selected. Perhaps the biggest change has been the enhanced capability for researchers to analyze their private data and compare it to the available public data. Researchers can assemble their raw sequence reads and annotate the contigs using RASTtk. PATRIC also provides services for RNA-Seq, variation, model reconstruction and differential expression analysis, all delivered through an updated private workspace. Private data can be compared by ‘virtual integration’ to any of PATRIC's public data. The number of genomes available for comparison in PATRIC has expanded to over 80 000, with a special emphasis on genomes with antimicrobial resistance data. PATRIC uses this data to improve both subsystem annotation and k-mer classification, and tags new genomes as having signatures that indicate susceptibility or resistance to specific antibiotics.
Collapse
Affiliation(s)
- Alice R Wattam
- Biocomplexity Institute, Virginia Tech University, Blacksburg, VA 24060, USA
| | - James J Davis
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Rida Assaf
- Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | | | - Thomas Brettin
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Christopher Bun
- Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Neal Conrad
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
| | - Emily M Dietrich
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Terry Disz
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA
| | - Joseph L Gabbard
- Grado Department of Industrial & Systems Engineering, Virginia Tech, Blacksburg, VA 24060, USA
| | - Svetlana Gerdes
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA
| | - Christopher S Henry
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
| | - Ronald W Kenyon
- Biocomplexity Institute, Virginia Tech University, Blacksburg, VA 24060, USA
| | - Dustin Machi
- Biocomplexity Institute, Virginia Tech University, Blacksburg, VA 24060, USA
| | - Chunhong Mao
- Biocomplexity Institute, Virginia Tech University, Blacksburg, VA 24060, USA
| | - Eric K Nordberg
- Biocomplexity Institute, Virginia Tech University, Blacksburg, VA 24060, USA
| | - Gary J Olsen
- Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Daniel E Murphy-Olson
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Robert Olson
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
| | - Ross Overbeek
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA.,Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA
| | - Bruce Parrello
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA.,Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA
| | - Gordon D Pusch
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA
| | - Maulik Shukla
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | | | - Andrew Warren
- Biocomplexity Institute, Virginia Tech University, Blacksburg, VA 24060, USA
| | - Fangfang Xia
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
| | - Hyunseung Yoo
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Rick L Stevens
- Computation Institute, University of Chicago, Chicago, IL 60637, USA.,Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA.,Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
7
|
Davis JJ, Gerdes S, Olsen GJ, Olson R, Pusch GD, Shukla M, Vonstein V, Wattam AR, Yoo H. PATtyFams: Protein Families for the Microbial Genomes in the PATRIC Database. Front Microbiol 2016; 7:118. [PMID: 26903996 PMCID: PMC4744870 DOI: 10.3389/fmicb.2016.00118] [Citation(s) in RCA: 110] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 01/22/2016] [Indexed: 01/12/2023] Open
Abstract
The ability to build accurate protein families is a fundamental operation in bioinformatics that influences comparative analyses, genome annotation, and metabolic modeling. For several years we have been maintaining protein families for all microbial genomes in the PATRIC database (Pathosystems Resource Integration Center, patricbrc.org) in order to drive many of the comparative analysis tools that are available through the PATRIC website. However, due to the burgeoning number of genomes, traditional approaches for generating protein families are becoming prohibitive. In this report, we describe a new approach for generating protein families, which we call PATtyFams. This method uses the k-mer-based function assignments available through RAST (Rapid Annotation using Subsystem Technology) to rapidly guide family formation, and then differentiates the function-based groups into families using a Markov Cluster algorithm (MCL). This new approach for generating protein families is rapid, scalable and has properties that are consistent with alignment-based methods.
Collapse
Affiliation(s)
- James J Davis
- Computation Institute, University of ChicagoChicago, IL, USA; Computing, Environment and Life Sciences, Argonne National LaboratoryArgonne IL, USA
| | - Svetlana Gerdes
- Computing, Environment and Life Sciences, Argonne National LaboratoryArgonne IL, USA; Fellowship for Interpretation of GenomesBurr Ridge, IL, USA
| | - Gary J Olsen
- Department of Microbiology and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign Urbana, IL, USA
| | - Robert Olson
- Computation Institute, University of ChicagoChicago, IL, USA; Mathematics and Computer Science Division, Argonne National LaboratoryArgonne, IL, USA
| | - Gordon D Pusch
- Computing, Environment and Life Sciences, Argonne National LaboratoryArgonne IL, USA; Fellowship for Interpretation of GenomesBurr Ridge, IL, USA
| | - Maulik Shukla
- Computation Institute, University of ChicagoChicago, IL, USA; Computing, Environment and Life Sciences, Argonne National LaboratoryArgonne IL, USA
| | - Veronika Vonstein
- Computing, Environment and Life Sciences, Argonne National LaboratoryArgonne IL, USA; Fellowship for Interpretation of GenomesBurr Ridge, IL, USA
| | - Alice R Wattam
- Virginia Bioinformatics Institute, Virginia Tech University Blacksburg, VA, USA
| | - Hyunseung Yoo
- Computation Institute, University of ChicagoChicago, IL, USA; Computing, Environment and Life Sciences, Argonne National LaboratoryArgonne IL, USA
| |
Collapse
|
8
|
Brettin T, Davis JJ, Disz T, Edwards RA, Gerdes S, Olsen GJ, Olson R, Overbeek R, Parrello B, Pusch GD, Shukla M, Thomason JA, Stevens R, Vonstein V, Wattam AR, Xia F. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes. Sci Rep 2015; 5:8365. [PMID: 25666585 PMCID: PMC4322359 DOI: 10.1038/srep08365] [Citation(s) in RCA: 1645] [Impact Index Per Article: 182.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2014] [Accepted: 01/02/2015] [Indexed: 12/31/2022] Open
Abstract
The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.
Collapse
Affiliation(s)
- Thomas Brettin
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Computation Institute, University of Chicago, Chicago, Illinois, 60637, USA
| | - James J. Davis
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Computation Institute, University of Chicago, Chicago, Illinois, 60637, USA
| | - Terry Disz
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, 60527, USA
| | - Robert A. Edwards
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, 60439, USA
- Department of Computer Science, San Diego State University, San Diego, California, 92182, USA
| | - Svetlana Gerdes
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, 60527, USA
| | - Gary J. Olsen
- Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Robert Olson
- Computation Institute, University of Chicago, Chicago, Illinois, 60637, USA
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, 60439, USA
| | - Ross Overbeek
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, 60527, USA
| | - Bruce Parrello
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, 60527, USA
| | - Gordon D. Pusch
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, 60527, USA
| | - Maulik Shukla
- Virginia Bioinformatics Institute, Virginia Tech University, Blacksburg, VA, 24060, USA
| | - James A. Thomason
- USDA-ARS Laboratory at Cold Spring Harbor Laboratory, Cold Spring Harbor NY, 11724, USA
| | - Rick Stevens
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Computation Institute, University of Chicago, Chicago, Illinois, 60637, USA
- Department of Computer Science, University of Chicago, Chicago, Illinois, 60637, USA
| | - Veronika Vonstein
- Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne IL, 60439, USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, 60527, USA
| | - Alice R. Wattam
- Virginia Bioinformatics Institute, Virginia Tech University, Blacksburg, VA, 24060, USA
| | - Fangfang Xia
- Computation Institute, University of Chicago, Chicago, Illinois, 60637, USA
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, 60439, USA
| |
Collapse
|
9
|
Faria JP, Edirisinghe JN, Davis JJ, Disz T, Hausmann A, Henry CS, Olson R, Overbeek RA, Pusch GD, Shukla M, Vonstein V, Wattam AR. Enabling comparative modeling of closely related genomes: example genus Brucella. 3 Biotech 2015; 5:101-105. [PMID: 28324362 PMCID: PMC4327756 DOI: 10.1007/s13205-014-0202-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2013] [Accepted: 02/17/2014] [Indexed: 12/22/2022] Open
Abstract
For many scientific applications, it is highly desirable to be able to compare metabolic models of closely related genomes. In this short report, we attempt to raise awareness to the fact that taking annotated genomes from public repositories and using them for metabolic model reconstructions is far from being trivial due to annotation inconsistencies. We are proposing a protocol for comparative analysis of metabolic models on closely related genomes, using fifteen strains of genus Brucella, which contains pathogens of both humans and livestock. This study lead to the identification and subsequent correction of inconsistent annotations in the SEED database, as well as the identification of 31 biochemical reactions that are common to Brucella, which are not originally identified by automated metabolic reconstructions. We are currently implementing this protocol for improving automated annotations within the SEED database and these improvements have been propagated into PATRIC, Model-SEED, KBase and RAST. This method is an enabling step for the future creation of consistent annotation systems and high-quality model reconstructions that will support in predicting accurate phenotypes such as pathogenicity, media requirements or type of respiration.
Collapse
Affiliation(s)
- José P Faria
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
- IBB-Institute for Biotechnology and Bioengineering, Centre of Biological Engineering, University of Minho, Campus de Gualtar, 4710-057, Braga, Portugal
| | - Janaka N Edirisinghe
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
- Computation Institute, University of Chicago, Chicago, IL, USA
| | - James J Davis
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA.
- Computation Institute, University of Chicago, Chicago, IL, USA.
| | - Terrence Disz
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
| | - Anna Hausmann
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, USA
| | - Christopher S Henry
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
- Computation Institute, University of Chicago, Chicago, IL, USA
| | - Robert Olson
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
| | - Ross A Overbeek
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, USA
| | - Gordon D Pusch
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, USA
| | - Maulik Shukla
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA
| | - Veronika Vonstein
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL, USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL, USA
| | - Alice R Wattam
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA
| |
Collapse
|
10
|
Overbeek R, Olson R, Pusch GD, Olsen GJ, Davis JJ, Disz T, Edwards RA, Gerdes S, Parrello B, Shukla M, Vonstein V, Wattam AR, Xia F, Stevens R. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res 2014; 42:D206-14. [PMID: 24293654 PMCID: PMC3965101 DOI: 10.1093/nar/gkt1226] [Citation(s) in RCA: 3096] [Impact Index Per Article: 309.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2013] [Revised: 11/04/2013] [Accepted: 11/05/2013] [Indexed: 01/12/2023] Open
Abstract
In 2004, the SEED (http://pubseed.theseed.org/) was created to provide consistent and accurate genome annotations across thousands of genomes and as a platform for discovering and developing de novo annotations. The SEED is a constantly updated integration of genomic data with a genome database, web front end, API and server scripts. It is used by many scientists for predicting gene functions and discovering new pathways. In addition to being a powerful database for bioinformatics research, the SEED also houses subsystems (collections of functionally related protein families) and their derived FIGfams (protein families), which represent the core of the RAST annotation engine (http://rast.nmpdr.org/). When a new genome is submitted to RAST, genes are called and their annotations are made by comparison to the FIGfam collection. If the genome is made public, it is then housed within the SEED and its proteins populate the FIGfam collection. This annotation cycle has proven to be a robust and scalable solution to the problem of annotating the exponentially increasing number of genomes. To date, >12 000 users worldwide have annotated >60 000 distinct genomes using RAST. Here we describe the interconnectedness of the SEED database and RAST, the RAST annotation pipeline and updates to both resources.
Collapse
Affiliation(s)
- Ross Overbeek
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Robert Olson
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Gordon D. Pusch
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Gary J. Olsen
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - James J. Davis
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Terry Disz
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Robert A. Edwards
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Svetlana Gerdes
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Bruce Parrello
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Maulik Shukla
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Veronika Vonstein
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Alice R. Wattam
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Fangfang Xia
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| | - Rick Stevens
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Department of Microbiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA, Department of Computer Science, San Diego State University, San Diego, CA 92182, USA, Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computing, Environment and Life Sciences, Argonne National Laboratory, Argonne, IL 60439, USA and Department of Computer Science, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
11
|
Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, Gillespie JJ, Gough R, Hix D, Kenyon R, Machi D, Mao C, Nordberg EK, Olson R, Overbeek R, Pusch GD, Shukla M, Schulman J, Stevens RL, Sullivan DE, Vonstein V, Warren A, Will R, Wilson MJC, Yoo HS, Zhang C, Zhang Y, Sobral BW. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic Acids Res 2013; 42:D581-91. [PMID: 24225323 PMCID: PMC3965095 DOI: 10.1093/nar/gkt1099] [Citation(s) in RCA: 873] [Impact Index Per Article: 79.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
The Pathosystems Resource Integration Center (PATRIC) is the all-bacterial Bioinformatics Resource Center (BRC) (http://www.patricbrc.org). A joint effort by two of the original National Institute of Allergy and Infectious Diseases-funded BRCs, PATRIC provides researchers with an online resource that stores and integrates a variety of data types [e.g. genomics, transcriptomics, protein-protein interactions (PPIs), three-dimensional protein structures and sequence typing data] and associated metadata. Datatypes are summarized for individual genomes and across taxonomic levels. All genomes in PATRIC, currently more than 10,000, are consistently annotated using RAST, the Rapid Annotations using Subsystems Technology. Summaries of different data types are also provided for individual genes, where comparisons of different annotations are available, and also include available transcriptomic data. PATRIC provides a variety of ways for researchers to find data of interest and a private workspace where they can store both genomic and gene associations, and their own private data. Both private and public data can be analyzed together using a suite of tools to perform comparative genomic or transcriptomic analysis. PATRIC also includes integrated information related to disease and PPIs. All the data and integrated analysis and visualization tools are freely available. This manuscript describes updates to the PATRIC since its initial report in the 2007 NAR Database Issue.
Collapse
Affiliation(s)
- Alice R Wattam
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA 24060, USA, Computation Institute, University of Chicago, Chicago, IL 60637, USA, Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60637, USA, Grado Department of Industrial & Systems Engineering, Virginia Tech, Blacksburg, VA 24060, USA, Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD 21201, USA, Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA, Computing, Environment, and Life Sciences, Argonne National Laboratory, Argonne, IL 60637, USA and Nestlé Institute of Health Sciences SA, Campus EPFL, Quartier de L'innovation, Lausanne, Switzerland
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Abstract
Summary: Annotation of metagenomes involves comparing the individual sequence reads with a database of known sequences and assigning a unique function to each read. This is a time-consuming task that is computationally intensive (though not computationally complex). Here we present a novel approach to annotate metagenomes using unique k-mer oligopeptide sequences from 7 to 12 amino acids long. We demonstrate that k-mer-based annotations are faster and approach the sensitivity and precision of blastx-based annotations without loosing accuracy. A last-common ancestor approach was also developed to describe the members of the community. Availability and implementation: This open-source application was implemented in Perl and can be accessed via a user-friendly website at http://edwards.sdsu.edu/rtmg. In addition, code to access the annotation servers is available for download from http://www.theseed.org/. FIGfams and k-mers are available for download from ftp://ftp.theseed.org/FIGfams/. Contact:redwards@mail.sdsu.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Robert A Edwards
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA.
| | | | | | | | | | | | | |
Collapse
|
13
|
Binter E, Binter S, Disz T, Kalmanek E, Powers A, Pusch GD, Turgeon J. Grounding annotations in published literature with an emphasis on the functional roles used in metabolic models. 3 Biotech 2011. [PMCID: PMC3376863 DOI: 10.1007/s13205-011-0039-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Accurate genome annotations in databases are a critical resource available to the scientific community for analysis and research. Inaccurate and inconsistent annotations exist as a result of errors generated from mass automated annotation, and currently act as a barrier to the application of bioinformatics. The purpose of this effort was to improve the SEED by improving the connection of functional roles to literature references. Direct literature references (DLits), found through searches of PubMed and other online databases such as SwissProt, were attached to protein sequences within the PubSEED to provide literature support for the roughly 2,500 distinct functional roles used to construct metabolic models within the Model SEED. Only DLits in which a researcher asserted the function of a protein were attached to sequences. Starting from a list of 1,072 functional roles that did not previously have DLit support, we were able to connect sequences to literature for 655 functional roles, at least 484 of which were in the original list of unsupported roles. When added to the existing set of sequences having DLits, the resulting set of DLit-sequence pairs (the foundation set) now connects approximately 4,300 DLits to approximately 5,600 distinct protein sequences obtained from approximately 16,000 genes (some of these genes have identical protein sequences). From the foundation set, we construct projection sets such that each set contains one member of the foundation set and projections of its functional role onto similar genes. The projection sets revealed 120 inconsistent annotations within the SEED. Two types of inconsistencies were corrected through manual annotation in the PubSEED: instances in which two identical protein sequences had been annotated with different functions, and instances when projected functions contradicted previous annotations. 26,785 changes to gene function assignment, 219 of which were to previously uncharacterized proteins, resulted in a more consistent and accurate set of input data from which to construct revised metabolic models within the Model SEED.
Collapse
Affiliation(s)
- Erik Binter
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Avenue, Argonne, IL 60439 USA
| | - Scott Binter
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Avenue, Argonne, IL 60439 USA
| | - Terry Disz
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Avenue, Argonne, IL 60439 USA
| | - Elizabeth Kalmanek
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Avenue, Argonne, IL 60439 USA
| | - Alexander Powers
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Avenue, Argonne, IL 60439 USA
| | - Gordon D. Pusch
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Avenue, Argonne, IL 60439 USA
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527 USA
| | - Julie Turgeon
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Avenue, Argonne, IL 60439 USA
| |
Collapse
|
14
|
Boissy R, Ahmed A, Janto B, Earl J, Hall BG, Hogg JS, Pusch GD, Hiller LN, Powell E, Hayes J, Yu S, Kathju S, Stoodley P, Post JC, Ehrlich GD, Hu FZ. Comparative supragenomic analyses among the pathogens Staphylococcus aureus, Streptococcus pneumoniae, and Haemophilus influenzae using a modification of the finite supragenome model. BMC Genomics 2011; 12:187. [PMID: 21489287 PMCID: PMC3094309 DOI: 10.1186/1471-2164-12-187] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2011] [Accepted: 04/13/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Staphylococcus aureus is associated with a spectrum of symbiotic relationships with its human host from carriage to sepsis and is frequently associated with nosocomial and community-acquired infections, thus the differential gene content among strains is of interest. RESULTS We sequenced three clinical strains and combined these data with 13 publically available human isolates and one bovine strain for comparative genomic analyses. All genomes were annotated using RAST, and then their gene similarities and differences were delineated. Gene clustering yielded 3,155 orthologous gene clusters, of which 2,266 were core, 755 were distributed, and 134 were unique. Individual genomes contained between 2,524 and 2,648 genes. Gene-content comparisons among all possible S. aureus strain pairs (n = 136) revealed a mean difference of 296 genes and a maximum difference of 476 genes. We developed a revised version of our finite supragenome model to estimate the size of the S. aureus supragenome (3,221 genes, with 2,245 core genes), and compared it with those of Haemophilus influenzae and Streptococcus pneumoniae. There was excellent agreement between RAST's annotations and our CDS clustering procedure providing for high fidelity metabolomic subsystem analyses to extend our comparative genomic characterization of these strains. CONCLUSIONS Using a multi-species comparative supragenomic analysis enabled by an improved version of our finite supragenome model we provide data and an interpretation explaining the relatively larger core genome of S. aureus compared to other opportunistic nasopharyngeal pathogens. In addition, we provide independent validation for the efficiency and effectiveness of our orthologous gene clustering algorithm.
Collapse
Affiliation(s)
- Robert Boissy
- Center for Genomic Sciences, Allegheny-Singer Research Institute, Pittsburgh, PA 15212, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O. The RAST Server: rapid annotations using subsystems technology. BMC Genomics 2008; 9:75. [PMID: 18261238 PMCID: PMC2265698 DOI: 10.1186/1471-2164-9-75] [Citation(s) in RCA: 8489] [Impact Index Per Article: 530.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2007] [Accepted: 02/08/2008] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND The number of prokaryotic genome sequences becoming available is growing steadily and is growing faster than our ability to accurately annotate them. DESCRIPTION We describe a fully automated service for annotating bacterial and archaeal genomes. The service identifies protein-encoding, rRNA and tRNA genes, assigns functions to the genes, predicts which subsystems are represented in the genome, uses this information to reconstruct the metabolic network and makes the output easily downloadable for the user. In addition, the annotated genome can be browsed in an environment that supports comparative analysis with the annotated genomes maintained in the SEED environment. The service normally makes the annotated genome available within 12-24 hours of submission, but ultimately the quality of such a service will be judged in terms of accuracy, consistency, and completeness of the produced annotations. We summarize our attempts to address these issues and discuss plans for incrementally enhancing the service. CONCLUSION By providing accurate, rapid annotation freely to the community we have created an important community resource. The service has now been utilized by over 120 external users annotating over 350 distinct genomes.
Collapse
Affiliation(s)
- Ramy K Aziz
- Fellowship for Interpretation of Genomes, Burr Ridge, IL 60527, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
16
|
Chen Y, Johnson JA, Pusch GD, Morris JG, Stine OC. The genome of non-O1 Vibrio cholerae NRT36S demonstrates the presence of pathogenic mechanisms that are distinct from those of O1 Vibrio cholerae. Infect Immun 2007; 75:2645-7. [PMID: 17283087 PMCID: PMC1865779 DOI: 10.1128/iai.01317-06] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Vibrio cholerae NRT36S is a non-cholera toxin-producing, non-O1 strain that causes diarrhea in volunteers. The genome of NRT36S was sequenced to create a draft containing 174 contigs plus the superintegron region. Our analysis of the draft genome revealed several putative toxin genes and colonization factors. Besides confirming the existence of nonagglutinable heat-stable toxin, we also identified the genes for a type three secretion system, a putative exotoxin, two different RTX toxins, and four pilus systems.
Collapse
Affiliation(s)
- Yuansha Chen
- University of Maryland, Howard Hall 585, 660 W. Redwood Street, Baltimore, MD 21201, USA
| | | | | | | | | |
Collapse
|
17
|
McNeil LK, Reich C, Aziz RK, Bartels D, Cohoon M, Disz T, Edwards RA, Gerdes S, Hwang K, Kubal M, Margaryan GR, Meyer F, Mihalo W, Olsen GJ, Olson R, Osterman A, Paarmann D, Paczian T, Parrello B, Pusch GD, Rodionov DA, Shi X, Vassieva O, Vonstein V, Zagnitko O, Xia F, Zinner J, Overbeek R, Stevens R. The National Microbial Pathogen Database Resource (NMPDR): a genomics platform based on subsystem annotation. Nucleic Acids Res 2006; 35:D347-53. [PMID: 17145713 PMCID: PMC1751540 DOI: 10.1093/nar/gkl947] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The National Microbial Pathogen Data Resource (NMPDR) () is a National Institute of Allergy and Infections Disease (NIAID)-funded Bioinformatics Resource Center that supports research in selected Category B pathogens. NMPDR contains the complete genomes of ∼50 strains of pathogenic bacteria that are the focus of our curators, as well as >400 other genomes that provide a broad context for comparative analysis across the three phylogenetic Domains. NMPDR integrates complete, public genomes with expertly curated biological subsystems to provide the most consistent genome annotations. Subsystems are sets of functional roles related by a biologically meaningful organizing principle, which are built over large collections of genomes; they provide researchers with consistent functional assignments in a biologically structured context. Investigators can browse subsystems and reactions to develop accurate reconstructions of the metabolic networks of any sequenced organism. NMPDR provides a comprehensive bioinformatics platform, with tools and viewers for genome analysis. Results of precomputed gene clustering analyses can be retrieved in tabular or graphic format with one-click tools. NMPDR tools include Signature Genes, which finds the set of genes in common or that differentiates two groups of organisms. Essentiality data collated from genome-wide studies have been curated. Drug target identification and high-throughput, in silico, compound screening are in development.
Collapse
|
18
|
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Rückert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005; 33:5691-702. [PMID: 16214803 PMCID: PMC1251668 DOI: 10.1093/nar/gki866] [Citation(s) in RCA: 1424] [Impact Index Per Article: 74.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The release of the 1000th complete microbial genome will occur in the next two to three years. In anticipation of this milestone, the Fellowship for Interpretation of Genomes (FIG) launched the Project to Annotate 1000 Genomes. The project is built around the principle that the key to improved accuracy in high-throughput annotation technology is to have experts annotate single subsystems over the complete collection of genomes, rather than having an annotation expert attempt to annotate all of the genes in a single genome. Using the subsystems approach, all of the genes implementing the subsystem are analyzed by an expert in that subsystem. An annotation environment was created where populated subsystems are curated and projected to new genomes. A portable notion of a populated subsystem was defined, and tools developed for exchanging and curating these objects. Tools were also developed to resolve conflicts between populated subsystems. The SEED is the first annotation environment that supports this model of annotation. Here, we describe the subsystem approach, and offer the first release of our growing library of populated subsystems. The initial release of data includes 180 177 distinct proteins with 2133 distinct functional roles. This data comes from 173 subsystems and 383 different organisms.
Collapse
Affiliation(s)
- Ross Overbeek
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
| | - Tadhg Begley
- Department of Chemistry and Chemical Biology, Cornell UniversityIthaca, NY14853, USA
| | - Ralph M. Butler
- Computer Science Dept, Middle Tennessee State UniversityMurfreesboro, TN 37132, USA
| | - Jomuna V. Choudhuri
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | | | - Matthew Cohoon
- Computation Institute, University of ChicagoChicago, IL 60637, USA
| | - Valérie de Crécy-Lagard
- Departments of Microbiology and Cell Science, University of FloridaGainesville, FL 32611, USA
| | - Naryttza Diaz
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | - Terry Disz
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
| | - Robert Edwards
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
- Center for Microbial Sciences, San Diego State UniversitySan Diego, CA 92813, USA
- The Burnham InstituteSan Diego CA 92037, USA
| | - Michael Fonstein
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
- Cleveland BioLabs, Inc.Cleveland, OH 44106, USA
| | - Ed D. Frank
- Mathematics and Computer Science Division, Argonne National LaboratoryArgonne, IL 60439, USA
| | - Svetlana Gerdes
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
| | - Elizabeth M. Glass
- Mathematics and Computer Science Division, Argonne National LaboratoryArgonne, IL 60439, USA
| | - Alexander Goesmann
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | - Andrew Hanson
- Department of Horticultural Science, University of FloridaGainesville, FL 32611, USA
| | - Dirk Iwata-Reuyl
- Department of Chemistry, Portland State UniversityPortland, OR 97207, USA
| | - Roy Jensen
- Emerson Hall, University of FloridaPO Box 14425, Gainesville, FL 32604, USA
| | | | - Lutz Krause
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | - Michael Kubal
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
| | - Niels Larsen
- Danish Genome InstituteGustav Wieds vej 10 C, DK-8000 Aarhus C, Denmark
| | - Burkhard Linke
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | - Alice C. McHardy
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | - Folker Meyer
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | - Heiko Neuweger
- Center for Biotechnology, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | - Gary Olsen
- Department of Microbiology, University of Illinois at Urbana-ChampaignUrbana, IL 61801
| | - Robert Olson
- Computation Institute, University of ChicagoChicago, IL 60637, USA
| | - Andrei Osterman
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
- The Burnham InstituteSan Diego CA 92037, USA
| | | | - Gordon D. Pusch
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
| | - Dmitry A. Rodionov
- Institute for Information Transmission Problems, Russian Academy of SciencesMoscow, Russia
| | - Christian Rückert
- International NRW Graduate School in Bioinformatics & Genome Research, Institute for Genome Research, Bielefeld University33594 Bielefeld, Germany, USA
| | | | - Rick Stevens
- Mathematics and Computer Science Division, Argonne National LaboratoryArgonne, IL 60439, USA
- Computation Institute, University of ChicagoChicago, IL 60637, USA
| | - Ines Thiele
- University of CaliforniaSan Diego, CA 92093, USA
| | - Olga Vassieva
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
| | - Yuzhen Ye
- The Burnham InstituteSan Diego CA 92037, USA
| | - Olga Zagnitko
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
| | - Veronika Vonstein
- Fellowship for Interpretation of Genomes15W155 81st Street, Burr Ridge, IL 60527, USA
- To whom correspondence should be addressed. Tel: +1 630 325 4178; Fax: +1 630 325 4179;
| |
Collapse
|
19
|
Bolotin A, Quinquis B, Renault P, Sorokin A, Ehrlich SD, Kulakauskas S, Lapidus A, Goltsman E, Mazur M, Pusch GD, Fonstein M, Overbeek R, Kyprides N, Purnelle B, Prozzi D, Ngui K, Masuy D, Hancy F, Burteau S, Boutry M, Delcour J, Goffeau A, Hols P. Complete sequence and comparative genome analysis of the dairy bacterium Streptococcus thermophilus. Nat Biotechnol 2004; 22:1554-8. [PMID: 15543133 PMCID: PMC7416660 DOI: 10.1038/nbt1034] [Citation(s) in RCA: 357] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2004] [Accepted: 09/21/2004] [Indexed: 02/06/2023]
Abstract
The lactic acid bacterium Streptococcus thermophilus is widely used for the manufacture of yogurt and cheese. This dairy species of major economic importance is phylogenetically close to pathogenic streptococci, raising the possibility that it has a potential for virulence. Here we report the genome sequences of two yogurt strains of S. thermophilus. We found a striking level of gene decay (10% pseudogenes) in both microorganisms. Many genes involved in carbon utilization are nonfunctional, in line with the paucity of carbon sources in milk. Notably, most streptococcal virulence-related genes that are not involved in basic cellular processes are either inactivated or absent in the dairy streptococcus. Adaptation to the constant milk environment appears to have resulted in the stabilization of the genome structure. We conclude that S. thermophilus has evolved mainly through loss-of-function events that remarkably mirror the environment of the dairy niche resulting in a severely diminished pathogenic potential.
Collapse
Affiliation(s)
- Alexander Bolotin
- Génétique Microbienne. Centre de Recherche de Jouy en Josas, Institut National de la Recherche Agronomique, Jouy en Josas, 78352 Cedex France
| | - Benoît Quinquis
- Génétique Microbienne. Centre de Recherche de Jouy en Josas, Institut National de la Recherche Agronomique, Jouy en Josas, 78352 Cedex France
| | - Pierre Renault
- Génétique Microbienne. Centre de Recherche de Jouy en Josas, Institut National de la Recherche Agronomique, Jouy en Josas, 78352 Cedex France
| | - Alexei Sorokin
- Génétique Microbienne. Centre de Recherche de Jouy en Josas, Institut National de la Recherche Agronomique, Jouy en Josas, 78352 Cedex France
| | - S Dusko Ehrlich
- Génétique Microbienne. Centre de Recherche de Jouy en Josas, Institut National de la Recherche Agronomique, Jouy en Josas, 78352 Cedex France
| | - Saulius Kulakauskas
- Unité de Recherche Latière et Génétique Appliquée, Centre de Recherche de Jouy en Josas, Institut National de la Recherche Agronomique, Jouy en Josas, 78352 Cedex France
| | - Alla Lapidus
- Integrated Genomics, Chicago, 60612 USA Illinois
- Present Address: Microbial Genomics, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, B400, Walnut Creek, California 94598 USA
| | - Eugene Goltsman
- Integrated Genomics, Chicago, 60612 USA Illinois
- Present Address: Microbial Genomics, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, B400, Walnut Creek, California 94598 USA
| | | | - Gordon D Pusch
- Integrated Genomics, Chicago, 60612 USA Illinois
- Present Address: Fellowship for Interpretation of Genomes, 15W155 81st Street, Burr Ridge, Illinois 60527 USA
| | - Michael Fonstein
- Integrated Genomics, Chicago, 60612 USA Illinois
- Present Address: Cleveland BioLabs, Inc., 10265 Carnegie Ave., Cleveland, Ohio 44106
| | - Ross Overbeek
- Integrated Genomics, Chicago, 60612 USA Illinois
- Present Address: Fellowship for Interpretation of Genomes, 15W155 81st Street, Burr Ridge, Illinois 60527 USA
| | - Nikos Kyprides
- Integrated Genomics, Chicago, 60612 USA Illinois
- Present Address: Microbial Genomics, Department of Energy Joint Genome Institute, 2800 Mitchell Drive, B400, Walnut Creek, California 94598 USA
| | - Bénédicte Purnelle
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| | - Deborah Prozzi
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| | - Katrina Ngui
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
- Present Address: Department Anatomy and Cell Biology, University of Melbourne, Victoria 3010 Australia
| | - David Masuy
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| | - Frédéric Hancy
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| | - Sophie Burteau
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
- Present Address: Unité de Recherche en Biologie Cellulaire, Facultés Universitaires Notre-Dame de la Paix, 61 Rue de Bruxelles, 5000 Namur, Belgium
| | - Marc Boutry
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| | - Jean Delcour
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| | - André Goffeau
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| | - Pascal Hols
- Institut des Sciences de la Vie, Université Catholique de Louvain, Louvain-la-Neuve, 1348 Belgium
| |
Collapse
|
20
|
Farahi K, Pusch GD, Overbeek R, Whitman WB. Detection of Lateral Gene Transfer Events in the Prokaryotic tRNA Synthetases by the Ratios of Evolutionary Distances Method. J Mol Evol 2004; 58:615-31. [PMID: 15170264 DOI: 10.1007/s00239-004-2582-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2002] [Accepted: 12/11/2003] [Indexed: 10/26/2022]
Abstract
The availability of large numbers of genomic sequences has demonstrated the importance of lateral gene transfer (LGT) in prokaryotic evolution. However, considerable uncertainty remains concerning the frequency of LGT compared to other evolutionary processes. To examine LGTs in ancient lineages of prokaryotes a method was developed that utilizes the ratios of evolutionary distances (RED) to distinguish between alternative evolutionary histories. The advantages of this approach are that the variability inherent in comparing protein sequences is transparent, the direction of LGT and the relative rates of evolution are readily identified, and it is possible to detect other types of evolutionary events. This method was standardized using 35 genes encoding ribosomal proteins that were believed to share a vertical evolution. Using RED-T, an original computer program designed to implement the RED method, the evolution of the genes encoding the 20 aminoacyl-tRNA synthetases was examined. Although LGTs were common in the evolution of the aminoacyl-tRNA synthetases, they were not sufficient to obscure the organismal phylogeny. Moreover, much of the apparent complexity of the gene tree was consistent with the formation of the paralogs in the ancestors to the modern lineages followed by more recent loss of one paralog or the other.
Collapse
Affiliation(s)
- Kamyar Farahi
- Department of Microbiology, 527 Biological Sciences, University of Georgia, Athens, GA 30602-2605, USA
| | | | | | | |
Collapse
|
21
|
Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N. Use of contiguity on the chromosome to predict functional coupling. In Silico Biol 2001; 1:93-108. [PMID: 11471247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2023]
Abstract
The availability of a growing number of completely sequenced genomes opens new opportunities for understanding of complex biological systems. Success of genome-based biology will, to a large extent, depend on the development of new approaches and tools for efficient comparative analysis of the genomes and their organization. We have developed a technique for detecting possible functional coupling between genes based on detection of potential operons. The approach involves computation of "pairs of close bidirectional best hits", which are pairs of genes that apparently occur within operons in multiple genomes. Using these pairs, one can compose evidence (based on the number of distinct genomes and the phylogenetic distance between the orthologous pairs) that a pair of genes is potentially functionally coupled. The technique has revealed a surprisingly rich and apparently accurate set of functionally coupled genes. The approach depends on the use of a relatively large number of genomes, and the amount of detected coupling grows dramatically as the number of genomes increases.
Collapse
Affiliation(s)
- R Overbeek
- Mathematics and Computer Science Division, Argonne National Laboratory, IL 60439, USA.
| | | | | | | | | |
Collapse
|
22
|
Overbeek R, Larsen N, Pusch GD, D'Souza M, Selkov E, Kyrpides N, Fonstein M, Maltsev N, Selkov E. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res 2000; 28:123-5. [PMID: 10592199 PMCID: PMC102471 DOI: 10.1093/nar/28.1.123] [Citation(s) in RCA: 261] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/1999] [Revised: 10/13/1999] [Accepted: 10/13/1999] [Indexed: 11/12/2022] Open
Abstract
The WIT (What Is There) (http://wit.mcs.anl.gov/WIT2/) system has been designed to support comparative analysis of sequenced genomes and to generate metabolic reconstructions based on chromosomal sequences and metabolic modules from the EMP/MPW family of databases. This system contains data derived from about 40 completed or nearly completed genomes. Sequence homologies, various ORF-clustering algorithms, relative gene positions on the chromosome and placement of gene products in metabolic pathways (metabolic reconstruction) can be used for the assignment of gene functions and for development of overviews of genomes within WIT. The integration of a large number of phylogenetically diverse genomes in WIT facilitates the understanding of the physiology of different organisms.
Collapse
Affiliation(s)
- R Overbeek
- Integrated Genomics Inc., 2201 W. Campbell Park Drive, Chicago, IL 60612, USA.
| | | | | | | | | | | | | | | | | |
Collapse
|
23
|
Abstract
Previously, we presented evidence that it is possible to predict functional coupling between genes based on conservation of gene clusters between genomes. With the rapid increase in the availability of prokaryotic sequence data, it has become possible to verify and apply the technique. In this paper, we extend our characterization of the parameters that determine the utility of the approach, and we generalize the approach in a way that supports detection of common classes of functionally coupled genes (e.g., transport and signal transduction clusters). Now that the analysis includes over 30 complete or nearly complete genomes, it has become clear that this approach will play a significant role in supporting efforts to assign functionality to the remaining uncharacterized genes in sequenced genomes.
Collapse
Affiliation(s)
- R Overbeek
- Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439-4844, USA.
| | | | | | | | | |
Collapse
|
24
|
|