1
|
Du G, Wu J, Zhang C, Cao X, Li L, He J, Zhang Y, Shang Y. The whole genomic analysis of the Orf virus strains ORFV-SC and ORFV-SC1 from the Sichuan province and their weak pathological response in rabbits. Funct Integr Genomics 2023; 23:163. [PMID: 37188892 DOI: 10.1007/s10142-023-01079-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/27/2023] [Accepted: 04/28/2023] [Indexed: 05/17/2023]
Abstract
The Orf virus (ORFV) is a member of the Parapoxvirus genus of the Poxviridae family and can cause contagious diseases in sheep, goats, and wild ungulates. In the present study, two ORFV isolates (ORFV-SC isolated from Sichuan province and ORFV-SC1 produced by 60 passages of ORFV-SC in cells) were sequenced and compared to multiple ORFVs. The two ORFV sequences had entire genome sizes of 14,0707 bp and 141,154 bp, respectively, containing 130 and 131 genes, with a G + C content of 63% for the ORFV-SC sequence and 63.9% for the ORFV-SC1 sequence. Alignment of ORFV-SC and ORFV-SC1 with five other ORFV isolates revealed that ORFV-SC, ORFV-SC1, and NA1/11 shared > 95% nucleotide identity with 109 genes. Five genes (ORF007, ORF20, ORF080, ORF112, ORF116) have low amino acids identity between ORFV-SC and ORFV-SC1. Mutations in amino acids result in changes in the secondary and tertiary structure of ORF007, ORF020, and ORF112 proteins. The phylogenetic tree based on the complete genome sequence and 37 single genes revealed that the two ORFV isolates originated from sheep. Finally, animal experiments demonstrated that ORFV-SC1 is less harmful to rabbits than ORFV-SC. The exploration of two full-length viral genome sequences provides valuable information in ORFV biology and epidemiology research. Furthermore, ORFV-SC1 demonstrated an acceptable safety profile following animal vaccination, indicating its potential as a live ORFV vaccine.
Collapse
Affiliation(s)
- Guoyu Du
- College of Veterinary Medicine, Gansu Agricultural University, Lanzhou, 730046, China
- State Key Laboratory for Animal Disease Control and Prevention, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Lanzhou, 730046, China
| | - Jinyan Wu
- State Key Laboratory for Animal Disease Control and Prevention, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Lanzhou, 730046, China
| | - Cheng Zhang
- State Key Laboratory of Veterinary Etiological Biology, Lanzhou Institute of Veterinary Research (CAAS) Institute, Chinese Academy of Agricultural Sciences, Lanzhou, 730046, China
| | - Xiaoan Cao
- State Key Laboratory for Animal Disease Control and Prevention, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Lanzhou, 730046, China
| | - Lingxia Li
- State Key Laboratory for Animal Disease Control and Prevention, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Lanzhou, 730046, China
| | - Jijun He
- State Key Laboratory for Animal Disease Control and Prevention, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Lanzhou, 730046, China
| | - Yong Zhang
- College of Veterinary Medicine, Gansu Agricultural University, Lanzhou, 730046, China.
| | - Youjun Shang
- State Key Laboratory for Animal Disease Control and Prevention, Lanzhou Veterinary Research Institute, Chinese Academy of Agricultural Sciences, Lanzhou, 730046, China.
| |
Collapse
|
2
|
Poudel S, Cope AL, O'Dell KB, Guss AM, Seo H, Trinh CT, Hettich RL. Identification and characterization of proteins of unknown function (PUFs) in Clostridium thermocellum DSM 1313 strains as potential genetic engineering targets. BIOTECHNOLOGY FOR BIOFUELS 2021; 14:116. [PMID: 33971924 PMCID: PMC8112048 DOI: 10.1186/s13068-021-01964-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 04/26/2021] [Indexed: 05/13/2023]
Abstract
BACKGROUND Mass spectrometry-based proteomics can identify and quantify thousands of proteins from individual microbial species, but a significant percentage of these proteins are unannotated and hence classified as proteins of unknown function (PUFs). Due to the difficulty in extracting meaningful metabolic information, PUFs are often overlooked or discarded during data analysis, even though they might be critically important in functional activities, in particular for metabolic engineering research. RESULTS We optimized and employed a pipeline integrating various "guilt-by-association" (GBA) metrics, including differential expression and co-expression analyses of high-throughput mass spectrometry proteome data and phylogenetic coevolution analysis, and sequence homology-based approaches to determine putative functions for PUFs in Clostridium thermocellum. Our various analyses provided putative functional information for over 95% of the PUFs detected by mass spectrometry in a wild-type and/or an engineered strain of C. thermocellum. In particular, we validated a predicted acyltransferase PUF (WP_003519433.1) with functional activity towards 2-phenylethyl alcohol, consistent with our GBA and sequence homology-based predictions. CONCLUSIONS This work demonstrates the value of leveraging sequence homology-based annotations with empirical evidence based on the concept of GBA to broadly predict putative functions for PUFs, opening avenues to further interrogation via targeted experiments.
Collapse
Affiliation(s)
- Suresh Poudel
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
- The Center for Bioenergy Innovation at Oak Ridge National Laboratory, Oak Ridge, TN, USA
- The Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN, USA
| | - Alexander L Cope
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
- The Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN, USA
| | - Kaela B O'Dell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
- The Center for Bioenergy Innovation at Oak Ridge National Laboratory, Oak Ridge, TN, USA
- The Bredesen Center, University of Tennessee, Knoxville, TN, USA
| | - Adam M Guss
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA
- The Bredesen Center, University of Tennessee, Knoxville, TN, USA
| | - Hyeongmin Seo
- The Center for Bioenergy Innovation at Oak Ridge National Laboratory, Oak Ridge, TN, USA
- Department of Chemical and Biomolecular Engineering, University of Tennessee, Knoxville, TN, USA
| | - Cong T Trinh
- The Center for Bioenergy Innovation at Oak Ridge National Laboratory, Oak Ridge, TN, USA
- The Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, TN, USA
- The Bredesen Center, University of Tennessee, Knoxville, TN, USA
- Department of Chemical and Biomolecular Engineering, University of Tennessee, Knoxville, TN, USA
| | - Robert L Hettich
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, 37831, USA.
| |
Collapse
|
3
|
Frederick J, Hennessy F, Horn U, de la Torre Cortés P, van den Broek M, Strych U, Willson R, Hefer CA, Daran JMG, Sewell T, Otten LG, Brady D. The complete genome sequence of the nitrile biocatalyst Rhodocccus rhodochrous ATCC BAA-870. BMC Genomics 2020; 21:3. [PMID: 31898479 PMCID: PMC6941271 DOI: 10.1186/s12864-019-6405-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Accepted: 12/16/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Rhodococci are industrially important soil-dwelling Gram-positive bacteria that are well known for both nitrile hydrolysis and oxidative metabolism of aromatics. Rhodococcus rhodochrous ATCC BAA-870 is capable of metabolising a wide range of aliphatic and aromatic nitriles and amides. The genome of the organism was sequenced and analysed in order to better understand this whole cell biocatalyst. RESULTS The genome of R. rhodochrous ATCC BAA-870 is the first Rhodococcus genome fully sequenced using Nanopore sequencing. The circular genome contains 5.9 megabase pairs (Mbp) and includes a 0.53 Mbp linear plasmid, that together encode 7548 predicted protein sequences according to BASys annotation, and 5535 predicted protein sequences according to RAST annotation. The genome contains numerous oxidoreductases, 15 identified antibiotic and secondary metabolite gene clusters, several terpene and nonribosomal peptide synthetase clusters, as well as 6 putative clusters of unknown type. The 0.53 Mbp plasmid encodes 677 predicted genes and contains the nitrile converting gene cluster, including a nitrilase, a low molecular weight nitrile hydratase, and an enantioselective amidase. Although there are fewer biotechnologically relevant enzymes compared to those found in rhodococci with larger genomes, such as the well-known Rhodococcus jostii RHA1, the abundance of transporters in combination with the myriad of enzymes found in strain BAA-870 might make it more suitable for use in industrially relevant processes than other rhodococci. CONCLUSIONS The sequence and comprehensive description of the R. rhodochrous ATCC BAA-870 genome will facilitate the additional exploitation of rhodococci for biotechnological applications, as well as enable further characterisation of this model organism. The genome encodes a wide range of enzymes, many with unknown substrate specificities supporting potential applications in biotechnology, including nitrilases, nitrile hydratase, monooxygenases, cytochrome P450s, reductases, proteases, lipases, and transaminases.
Collapse
Affiliation(s)
- Joni Frederick
- Protein Technologies, CSIR Biosciences, Meiring Naude Road, Brummeria, Pretoria, South Africa
- Electron Microscope Unit, University of Cape Town, Rondebosch, 7701 South Africa
- Present Address: LadHyx, UMR CNRS 7646, École Polytechnique, 91128 Palaiseau, France
| | - Fritha Hennessy
- Protein Technologies, CSIR Biosciences, Meiring Naude Road, Brummeria, Pretoria, South Africa
| | - Uli Horn
- Meraka, CSIR, Meiring Naude Road, Brummeria, 0091 South Africa
| | - Pilar de la Torre Cortés
- Industrial Microbiology, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
| | - Marcel van den Broek
- Industrial Microbiology, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
| | - Ulrich Strych
- Biology and Biochemistry, University of Houston, 4800 Calhoun Road, Houston, TX 77204 USA
- Present Address: Department of Pediatrics, Section of Tropical Medicine, Baylor College of Medicine, 1102 Bates Avenue, Houston, TX 77030 USA
| | - Richard Willson
- Biology and Biochemistry, University of Houston, 4800 Calhoun Road, Houston, TX 77204 USA
- Chemical and Biomolecular Engineering, University of Houston, 4800 Calhoun Road, Houston, TX 77204 USA
| | - Charles A. Hefer
- Bioinformatics and Computational Biology Unit, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, 0002 South Africa
- Present Address: AgResearch Limited, Lincoln Research Centre, Private Bag 4749, Christchurch, 8140 New Zealand
| | - Jean-Marc G. Daran
- Industrial Microbiology, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
| | - Trevor Sewell
- Electron Microscope Unit, University of Cape Town, Rondebosch, 7701 South Africa
| | - Linda G. Otten
- Biocatalysis, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
| | - Dean Brady
- Protein Technologies, CSIR Biosciences, Meiring Naude Road, Brummeria, Pretoria, South Africa
- Molecular Sciences Institute, School of Chemistry, University of the Witwatersrand, PO, Wits, 2050 South Africa
| |
Collapse
|
4
|
Komárek J, Ivanov Kavková E, Houser J, Horáčková A, Ždánská J, Demo G, Wimmerová M. Structure and properties of AB21, a novelAgaricus bisporusprotein with structural relation to bacterial pore-forming toxins. Proteins 2018; 86:897-911. [DOI: 10.1002/prot.25522] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2018] [Revised: 04/23/2018] [Accepted: 04/26/2018] [Indexed: 12/13/2022]
Affiliation(s)
- Jan Komárek
- Central European Institute of Technology, Masaryk University, Kamenice 5; Brno 62500 Czech Republic
- National Centre for Biomolecular Research; Faculty of Science, Masaryk University, Kotlarska 2; Brno 61137 Czech Republic
| | - Eva Ivanov Kavková
- Department of Biochemistry; Faculty of Science, Masaryk University, Kotlarska 2; Brno 61137 Czech Republic
| | - Josef Houser
- Central European Institute of Technology, Masaryk University, Kamenice 5; Brno 62500 Czech Republic
- National Centre for Biomolecular Research; Faculty of Science, Masaryk University, Kotlarska 2; Brno 61137 Czech Republic
| | - Aneta Horáčková
- Department of Biochemistry; Faculty of Science, Masaryk University, Kotlarska 2; Brno 61137 Czech Republic
| | - Jitka Ždánská
- Central European Institute of Technology, Masaryk University, Kamenice 5; Brno 62500 Czech Republic
| | - Gabriel Demo
- Central European Institute of Technology, Masaryk University, Kamenice 5; Brno 62500 Czech Republic
- National Centre for Biomolecular Research; Faculty of Science, Masaryk University, Kotlarska 2; Brno 61137 Czech Republic
| | - Michaela Wimmerová
- Central European Institute of Technology, Masaryk University, Kamenice 5; Brno 62500 Czech Republic
- National Centre for Biomolecular Research; Faculty of Science, Masaryk University, Kotlarska 2; Brno 61137 Czech Republic
- Department of Biochemistry; Faculty of Science, Masaryk University, Kotlarska 2; Brno 61137 Czech Republic
| |
Collapse
|
5
|
Bouadjenek MR, Verspoor K, Zobel J. Literature consistency of bioinformatics sequence databases is effective for assessing record quality. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017; 2017:3074790. [PMID: 28365737 PMCID: PMC5467556 DOI: 10.1093/database/bax021] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/20/2017] [Indexed: 11/18/2022]
Abstract
Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records. Database URL: https://github.com/rbouadjenek/DQBioinformatics
Collapse
|
6
|
Abstract
Catalysts are a vital part of synthetic chemistry. However, there are still many important reactions for which catalysts have not been developed. The use of enzymes as biocatalysts for synthetic chemistry is growing in importance due to the drive towards sustainable methods for producing both bulk chemicals and high value compounds such as pharmaceuticals, and due to the ability of enzymes to catalyse chemical reactions with excellent stereoselectivity and regioselectivity. Such challenging transformations are a common feature of natural product biosynthetic pathways. In this mini-review, we discuss the potential to use biosynthetic pathways as a starting point for biocatalyst discovery. We introduce the reader to natural product assembly and tailoring, then focus on four classes of enzyme that catalyse C─H bond activation reactions to functionalize biosynthetic precursors. Finally, we briefly discuss the challenges involved in novel enzyme discovery.
Collapse
|
7
|
Kumar G, Johnson JL, Frantom PA. Improving Functional Annotation in the DRE-TIM Metallolyase Superfamily through Identification of Active Site Fingerprints. Biochemistry 2016; 55:1863-72. [PMID: 26935545 DOI: 10.1021/acs.biochem.5b01193] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Within the DRE-TIM metallolyase superfamily, members of the Claisen-like condensation (CC-like) subgroup catalyze C-C bond-forming reactions between various α-ketoacids and acetyl-coenzyme A. These reactions are important in the metabolic pathways of many bacterial pathogens and serve as engineering scaffolds for the production of long-chain alcohol biofuels. To improve functional annotation and identify sequences that might use novel substrates in the CC-like subgroup, a combination of structural modeling and multiple-sequence alignments identified active site residues on the third, fourth, and fifth β-strands of the TIM-barrel catalytic domain that are differentially conserved within the substrate-diverse enzyme families. Using α-isopropylmalate synthase and citramalate synthase from Methanococcus jannaschii (MjIPMS and MjCMS), site-directed mutagenesis was used to test the role of each identified position in substrate selectivity. Kinetic data suggest that residues at the β3-5 and β4-7 positions play a significant role in the selection of α-ketoisovalerate over pyruvate in MjIPMS. However, complementary substitutions in MjCMS fail to alter substrate specificity, suggesting residues in these positions do not contribute to substrate selectivity in this enzyme. Analysis of the kinetic data with respect to a protein similarity network for the CC-like subgroup suggests that evolutionarily distinct forms of IPMS utilize residues at the β3-5 and β4-7 positions to affect substrate selectivity while the different versions of CMS use unique architectures. Importantly, mapping the identities of residues at the β3-5 and β4-7 positions onto the protein similarity network allows for rapid annotation of probable IPMS enzymes as well as several outlier sequences that may represent novel functions in the subgroup.
Collapse
Affiliation(s)
- Garima Kumar
- Department of Chemistry, The University of Alabama , 250 Hackberry Lane, Tuscaloosa, Alabama 35487, United States
| | - Jordyn L Johnson
- Department of Chemistry, The University of Alabama , 250 Hackberry Lane, Tuscaloosa, Alabama 35487, United States
| | - Patrick A Frantom
- Department of Chemistry, The University of Alabama , 250 Hackberry Lane, Tuscaloosa, Alabama 35487, United States
| |
Collapse
|
8
|
Neuhaus K, Landstorfer R, Fellner L, Simon S, Schafferhans A, Goldberg T, Marx H, Ozoline ON, Rost B, Kuster B, Keim DA, Scherer S. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC). BMC Genomics 2016; 17:133. [PMID: 26911138 PMCID: PMC4765031 DOI: 10.1186/s12864-016-2456-1] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 02/09/2016] [Indexed: 12/30/2022] Open
Abstract
Background Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome). Results Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization. Conclusions These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2456-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Klaus Neuhaus
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Richard Landstorfer
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Lea Fellner
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| | - Svenja Simon
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
| | - Andrea Schafferhans
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Tatyana Goldberg
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Harald Marx
- Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany.
| | - Olga N Ozoline
- Institute of Cell Biophysics, Russian Academy of Sciences, Moscow Region, 142290, Pushchino, Russia.
| | - Burkhard Rost
- Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
| | - Bernhard Kuster
- Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany. .,Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), Technische Universität München, Gregor-Mendel-Str. 4, 85354, Freising, Germany.
| | - Daniel A Keim
- Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
| | - Siegfried Scherer
- Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
| |
Collapse
|
9
|
Andrews FH, Horton JD, Shin D, Yoon HJ, Logsdon MG, Malik AM, Rogers MP, Kneen MM, Suh SW, McLeish MJ. The kinetic characterization and X-ray structure of a putative benzoylformate decarboxylase from M. smegmatis highlights the difficulties in the functional annotation of ThDP-dependent enzymes. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015; 1854:1001-9. [DOI: 10.1016/j.bbapap.2015.04.027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2014] [Revised: 04/05/2015] [Accepted: 04/23/2015] [Indexed: 10/23/2022]
|
10
|
Kuznetsova E, Nocek B, Brown G, Makarova KS, Flick R, Wolf YI, Khusnutdinova A, Evdokimova E, Jin K, Tan K, Hanson AD, Hasnain G, Zallot R, de Crécy-Lagard V, Babu M, Savchenko A, Joachimiak A, Edwards AM, Koonin EV, Yakunin AF. Functional Diversity of Haloacid Dehalogenase Superfamily Phosphatases from Saccharomyces cerevisiae: BIOCHEMICAL, STRUCTURAL, AND EVOLUTIONARY INSIGHTS. J Biol Chem 2015; 290:18678-98. [PMID: 26071590 DOI: 10.1074/jbc.m115.657916] [Citation(s) in RCA: 66] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Indexed: 12/15/2022] Open
Abstract
The haloacid dehalogenase (HAD)-like enzymes comprise a large superfamily of phosphohydrolases present in all organisms. The Saccharomyces cerevisiae genome encodes at least 19 soluble HADs, including 10 uncharacterized proteins. Here, we biochemically characterized 13 yeast phosphatases from the HAD superfamily, which includes both specific and promiscuous enzymes active against various phosphorylated metabolites and peptides with several HADs implicated in detoxification of phosphorylated compounds and pseudouridine. The crystal structures of four yeast HADs provided insight into their active sites, whereas the structure of the YKR070W dimer in complex with substrate revealed a composite substrate-binding site. Although the S. cerevisiae and Escherichia coli HADs share low sequence similarities, the comparison of their substrate profiles revealed seven phosphatases with common preferred substrates. The cluster of secondary substrates supporting significant activity of both S. cerevisiae and E. coli HADs includes 28 common metabolites that appear to represent the pool of potential activities for the evolution of novel HAD phosphatases. Evolution of novel substrate specificities of HAD phosphatases shows no strict correlation with sequence divergence. Thus, evolution of the HAD superfamily combines the conservation of the overall substrate pool and the substrate profiles of some enzymes with remarkable biochemical and structural flexibility of other superfamily members.
Collapse
Affiliation(s)
- Ekaterina Kuznetsova
- From the Structural Genomics Consortium, University of Toronto, Toronto, Ontario M5G 1L7, Canada
| | - Boguslaw Nocek
- the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
| | - Greg Brown
- the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
| | - Kira S Makarova
- the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| | - Robert Flick
- the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
| | - Yuri I Wolf
- the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| | - Anna Khusnutdinova
- the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
| | - Elena Evdokimova
- the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
| | - Ke Jin
- the Department of Biochemistry, Research and Innovation Centre, University of Regina, Regina, Saskatchewan S4S 0A2, Canada, and
| | - Kemin Tan
- the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
| | - Andrew D Hanson
- the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
| | - Ghulam Hasnain
- the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
| | - Rémi Zallot
- the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
| | - Valérie de Crécy-Lagard
- the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
| | - Mohan Babu
- the Department of Biochemistry, Research and Innovation Centre, University of Regina, Regina, Saskatchewan S4S 0A2, Canada, and
| | - Alexei Savchenko
- the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
| | - Andrzej Joachimiak
- the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
| | - Aled M Edwards
- From the Structural Genomics Consortium, University of Toronto, Toronto, Ontario M5G 1L7, Canada, the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
| | - Eugene V Koonin
- the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
| | - Alexander F Yakunin
- the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada,
| |
Collapse
|
11
|
Stanberry L, Rekepalli B, Liu Y, Giblock P, Higdon R, Montague E, Broomall W, Kolker N, Kolker E. Optimizing high performance computing workflow for protein functional annotation. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014; 26:2112-2121. [PMID: 25313296 PMCID: PMC4194055 DOI: 10.1002/cpe.3264] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
Collapse
Affiliation(s)
- Larissa Stanberry
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Bhanu Rekepalli
- Joint Institute for Computational Sciences, University of Tennessee - Oak Ridge National Laboratory (JICS UT - ORNL), DELSA Global, Oak Ridge, TN, USA
| | - Yuan Liu
- Joint Institute for Computational Sciences, University of Tennessee - Oak Ridge National Laboratory (JICS UT - ORNL), DELSA Global, Oak Ridge, TN, USA
| | | | - Roger Higdon
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Elizabeth Montague
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - William Broomall
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Natali Kolker
- Bioinformatics & High-Throughput Analysis Laboratory and High-Throughput Analysis Core, Seattle Children's Research Institute (SCRI), DELSA Global, Seattle, WA 98101, USA
| | - Eugene Kolker
- Bioinformatics & High-throughput Analysis Laboratory, SCRI, High-throughput Analysis Core, SCRI, Predicitive Analytics, Seattle Children's Hospital, Departments of Pediatrics and Biomedical Informatics & Medical Education, University of Washington, DELSA Global
| |
Collapse
|
12
|
van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones D, Kim PM, Kriwacki R, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright P, Babu MM. Classification of intrinsically disordered regions and proteins. Chem Rev 2014; 114:6589-631. [PMID: 24773235 PMCID: PMC4095912 DOI: 10.1021/cr400525m] [Citation(s) in RCA: 1401] [Impact Index Per Article: 140.1] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2013] [Indexed: 12/11/2022]
Affiliation(s)
- Robin van der Lee
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
- Centre
for Molecular and Biomolecular Informatics, Radboud University Medical Centre, 6500 HB Nijmegen, The
Netherlands
| | - Marija Buljan
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| | - Benjamin Lang
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| | - Robert J. Weatheritt
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| | - Gary W. Daughdrill
- Department
of Cell Biology, Microbiology, and Molecular Biology, University of South Florida, 3720 Spectrum Boulevard, Suite 321, Tampa, Florida 33612, United States
| | - A. Keith Dunker
- Department
of Biochemistry and Molecular Biology, Indiana
University School of Medicine, Indianapolis, Indiana 46202, United States
| | - Monika Fuxreiter
- MTA-DE
Momentum Laboratory of Protein Dynamics, Department of Biochemistry
and Molecular Biology, University of Debrecen, H-4032 Debrecen, Nagyerdei krt 98, Hungary
| | - Julian Gough
- Department
of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, United Kingdom
| | - Joerg Gsponer
- Department
of Biochemistry and Molecular Biology, Centre for High-Throughput
Biology, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - David
T. Jones
- Bioinformatics
Group, Department of Computer Science, University
College London, London, WC1E 6BT, United Kingdom
| | - Philip M. Kim
- Terrence Donnelly Centre for Cellular and Biomolecular Research, Department of Molecular
Genetics, and Department of Computer Science, University
of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Richard
W. Kriwacki
- Department
of Structural Biology, St. Jude Children’s
Research Hospital, Memphis, Tennessee 38105, United States
| | - Christopher J. Oldfield
- Department
of Biochemistry and Molecular Biology, Indiana
University School of Medicine, Indianapolis, Indiana 46202, United States
| | - Rohit V. Pappu
- Department
of Biomedical Engineering and Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, United States
| | - Peter Tompa
- VIB Department
of Structural Biology, Vrije Universiteit
Brussel, Brussels, Belgium
- Institute
of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary
| | - Vladimir N. Uversky
- Department
of Molecular Medicine and USF Health Byrd Alzheimer’s Research
Institute, Morsani College of Medicine, University of South Florida, Tampa, Florida 33612, United States
- Institute for Biological Instrumentation,
Russian Academy of Sciences, Pushchino,
Moscow Region, Russia
| | - Peter
E. Wright
- Department
of Integrative Structural and Computational Biology and Skaggs Institute
of Chemical Biology, The Scripps Research
Institute, 10550 North
Torrey Pines Road, La Jolla, California 92037, United States
| | - M. Madan Babu
- MRC
Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
| |
Collapse
|
13
|
Carrera J, Estrela R, Luo J, Rai N, Tsoukalas A, Tagkopoulos I. An integrative, multi-scale, genome-wide model reveals the phenotypic landscape of Escherichia coli. Mol Syst Biol 2014; 10:735. [PMID: 24987114 PMCID: PMC4299492 DOI: 10.15252/msb.20145108] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Given the vast behavioral repertoire and biological complexity of even the simplest organisms,
accurately predicting phenotypes in novel environments and unveiling their biological organization
is a challenging endeavor. Here, we present an integrative modeling methodology that unifies under a
common framework the various biological processes and their interactions across multiple layers. We
trained this methodology on an extensive normalized compendium for the gram-negative bacterium
Escherichia coli, which incorporates gene expression data for genetic and
environmental perturbations, transcriptional regulation, signal transduction, and metabolic
pathways, as well as growth measurements. Comparison with measured growth and high-throughput data
demonstrates the enhanced ability of the integrative model to predict phenotypic outcomes in various
environmental and genetic conditions, even in cases where their underlying functions are
under-represented in the training set. This work paves the way toward integrative techniques that
extract knowledge from a variety of biological data to achieve more than the sum of their parts in
the context of prediction, analysis, and redesign of biological systems.
Collapse
Affiliation(s)
- Javier Carrera
- UC Davis Genome Center, University of California, Davis, CA, USA
| | - Raissa Estrela
- Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA
| | - Jing Luo
- UC Davis Genome Center, University of California, Davis, CA, USA
| | - Navneet Rai
- UC Davis Genome Center, University of California, Davis, CA, USA
| | - Athanasios Tsoukalas
- UC Davis Genome Center, University of California, Davis, CA, USA Department of Computer Science, University of California, Davis, CA, USA
| | - Ilias Tagkopoulos
- UC Davis Genome Center, University of California, Davis, CA, USA Department of Computer Science, University of California, Davis, CA, USA
| |
Collapse
|
14
|
de Crécy-Lagard V. Variations in metabolic pathways create challenges for automated metabolic reconstructions: Examples from the tetrahydrofolate synthesis pathway. Comput Struct Biotechnol J 2014; 10:41-50. [PMID: 25210598 PMCID: PMC4151868 DOI: 10.1016/j.csbj.2014.05.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
The availability of thousands of sequenced genomes has revealed the diversity of biochemical solutions to similar chemical problems. Even for molecules at the heart of metabolism, such as cofactors, the pathway enzymes first discovered in model organisms like Escherichia coli or Saccharomyces cerevisiae are often not universally conserved. Tetrahydrofolate (THF) (or its close relative tetrahydromethanopterin) is a universal and essential C1-carrier that most microbes and plants synthesize de novo. The THF biosynthesis pathway and enzymes are, however, not universal and alternate solutions are found for most steps, making this pathway a challenge to annotate automatically in many genomes. Comparing THF pathway reconstructions and functional annotations of a chosen set of folate synthesis genes in specific prokaryotes revealed the strengths and weaknesses of different microbial annotation platforms. This analysis revealed that most current platforms fail in metabolic reconstruction of variant pathways. However, all the pieces are in place to quickly correct these deficiencies if the different databases were built on each other's strengths.
Collapse
Affiliation(s)
- Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science and Genetics Institute, University of Florida, Gainesville, FL, United States
| |
Collapse
|
15
|
Rapid identification of sequences for orphan enzymes to power accurate protein annotation. PLoS One 2013; 8:e84508. [PMID: 24386392 PMCID: PMC3875567 DOI: 10.1371/journal.pone.0084508] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Accepted: 11/21/2013] [Indexed: 11/19/2022] Open
Abstract
The power of genome sequencing depends on the ability to understand what those genes and their proteins products actually do. The automated methods used to assign functions to putative proteins in newly sequenced organisms are limited by the size of our library of proteins with both known function and sequence. Unfortunately this library grows slowly, lagging well behind the rapid increase in novel protein sequences produced by modern genome sequencing methods. One potential source for rapidly expanding this functional library is the "back catalog" of enzymology--"orphan enzymes," those enzymes that have been characterized and yet lack any associated sequence. There are hundreds of orphan enzymes in the Enzyme Commission (EC) database alone. In this study, we demonstrate how this orphan enzyme "back catalog" is a fertile source for rapidly advancing the state of protein annotation. Starting from three orphan enzyme samples, we applied mass-spectrometry based analysis and computational methods (including sequence similarity networks, sequence and structural alignments, and operon context analysis) to rapidly identify the specific sequence for each orphan while avoiding the most time- and labor-intensive aspects of typical sequence identifications. We then used these three new sequences to more accurately predict the catalytic function of 385 previously uncharacterized or misannotated proteins. We expect that this kind of rapid sequence identification could be efficiently applied on a larger scale to make enzymology's "back catalog" another powerful tool to drive accurate genome annotation.
Collapse
|
16
|
Liberal R, Pinney JW. Simple topological properties predict functional misannotations in a metabolic network. Bioinformatics 2013; 29:i154-61. [PMID: 23812979 PMCID: PMC3694667 DOI: 10.1093/bioinformatics/btt236] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Motivation: Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism’s metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation. Results: We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes). Contact:j.pinney@imperial.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rodrigo Liberal
- Department of Life Sciences and Centre for Integrative Systems Biology and Bioinformatics, Imperial College London, London SW7 2AZ, UK
| | | |
Collapse
|
17
|
Comparative genomics approaches to understanding and manipulating plant metabolism. Curr Opin Biotechnol 2013; 24:278-84. [DOI: 10.1016/j.copbio.2012.07.005] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 07/29/2012] [Accepted: 07/30/2012] [Indexed: 12/11/2022]
|
18
|
Blais EM, Chavali AK, Papin JA. Linking genome-scale metabolic modeling and genome annotation. Methods Mol Biol 2013; 985:61-83. [PMID: 23417799 DOI: 10.1007/978-1-62703-299-5_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Genome-scale metabolic network reconstructions, assembled from annotated genomes, serve as a platform for integrating data from heterogeneous sources and generating hypotheses for further experimental validation. Implementing constraint-based modeling techniques such as flux balance analysis (FBA) on network reconstructions allows for interrogating metabolism at a systems level, which aids in identifying and rectifying gaps in knowledge. With genome sequences for various organisms from prokaryotes to eukaryotes becoming increasingly available, a significant bottleneck lies in the structural and functional annotation of these sequences. Using topologically based and biologically inspired metabolic network refinement, we can better characterize enzymatic functions present in an organism and link annotation of these functions to candidate transcripts; both steps can be experimentally validated.
Collapse
Affiliation(s)
- Edik M Blais
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA
| | | | | |
Collapse
|
19
|
Structural analysis of hypothetical proteins from Helicobacter pylori: an approach to estimate functions of unknown or hypothetical proteins. Int J Mol Sci 2012; 13:7109-7137. [PMID: 22837682 PMCID: PMC3397514 DOI: 10.3390/ijms13067109] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2012] [Revised: 05/29/2012] [Accepted: 06/01/2012] [Indexed: 12/12/2022] Open
Abstract
Helicobacter pylori (H. pylori) have a unique ability to survive in extreme acidic environments and to colonize the gastric mucosa. It can cause diverse gastric diseases such as peptic ulcers, chronic gastritis, mucosa-associated lymphoid tissue (MALT) lymphoma, gastric cancer, etc. Based on genomic research of H. pylori, over 1600 genes have been functionally identified so far. However, H. pylori possess some genes that are uncharacterized since: (i) the gene sequences are quite new; (ii) the function of genes have not been characterized in any other bacterial systems; and (iii) sometimes, the protein that is classified into a known protein based on the sequence homology shows some functional ambiguity, which raises questions about the function of the protein produced in H. pylori. Thus, there are still a lot of genes to be biologically or biochemically characterized to understand the whole picture of gene functions in the bacteria. In this regard, knowledge on the 3D structure of a protein, especially unknown or hypothetical protein, is frequently useful to elucidate the structure-function relationship of the uncharacterized gene product. That is, a structural comparison with known proteins provides valuable information to help predict the cellular functions of hypothetical proteins. Here, we show the 3D structures of some hypothetical proteins determined by NMR spectroscopy and X-ray crystallography as a part of the structural genomics of H. pylori. In addition, we show some successful approaches of elucidating the function of unknown proteins based on their structural information.
Collapse
|
20
|
Jaeger S, Aloy P. From protein interaction networks to novel therapeutic strategies. IUBMB Life 2012; 64:529-37. [DOI: 10.1002/iub.1040] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2012] [Accepted: 03/14/2012] [Indexed: 01/18/2023]
|
21
|
Seaver SMD, Henry CS, Hanson AD. Frontiers in metabolic reconstruction and modeling of plant genomes. JOURNAL OF EXPERIMENTAL BOTANY 2012; 63:2247-58. [PMID: 22238452 DOI: 10.1093/jxb/err371] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
A major goal of post-genomic biology is to reconstruct and model in silico the metabolic networks of entire organisms. Work on bacteria is well advanced, and is now under way for plants and other eukaryotes. Genome-scale modelling in plants is much more challenging than in bacteria. The challenges come from features characteristic of higher organisms (subcellular compartmentation, tissue differentiation) and also from the particular severity in plants of a general problem: genome content whose functions remain undiscovered. This problem results in thousands of genes for which no function is known ('undiscovered genome content') and hundreds of enzymatic and transport functions for which no gene is yet identified. The severity of the undiscovered genome content problem in plants reflects their genome size and complexity. To bring the challenges of plant genome-scale modelling into focus, we first summarize the current status of plant genome-scale models. We then highlight the challenges - and ways to address them - in three areas: identifying genes for missing processes, modelling tissues as opposed to single cells, and finding metabolic functions encoded by undiscovered genome content. We also discuss the emerging view that a significant fraction of undiscovered genome content encodes functions that counter damage to metabolites inflicted by spontaneous chemical reactions or enzymatic mistakes.
Collapse
Affiliation(s)
- Samuel M D Seaver
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
| | | | | |
Collapse
|
22
|
Acebo P, Martin-Galiano AJ, Navarro S, Zaballos Á, Amblar M. Identification of 88 regulatory small RNAs in the TIGR4 strain of the human pathogen Streptococcus pneumoniae. RNA (NEW YORK, N.Y.) 2012; 18:530-546. [PMID: 22274957 PMCID: PMC3285940 DOI: 10.1261/rna.027359.111] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2011] [Accepted: 12/02/2011] [Indexed: 05/31/2023]
Abstract
Streptococcus pneumoniae is the main etiological agent of community-acquired pneumonia and a major cause of mortality and morbidity among children and the elderly. Genome sequencing of several pneumococcal strains revealed valuable information about the potential proteins and genetic diversity of this prevalent human pathogen. However, little is known about its transcriptional regulation and its small regulatory noncoding RNAs. In this study, we performed deep sequencing of the S. pneumoniae TIGR4 strain RNome to identify small regulatory RNA candidates expressed in this pathogen. We discovered 1047 potential small RNAs including intragenic, 5'- and/or 3'-overlapping RNAs and 88 small RNAs encoded in intergenic regions. With this approach, we recovered many of the previously identified intergenic small RNAs and identified 68 novel candidates, most of which are conserved in both sequence and genomic context in other S. pneumoniae strains. We confirmed the independent expression of 17 intergenic small RNAs and predicted putative mRNA targets for six of them using bioinformatics tools. Preliminary results suggest that one of these six is a key player in the regulation of competence development. This study is the biggest catalog of small noncoding RNAs reported to date in S. pneumoniae and provides a highly complete view of the small RNA network in this pathogen.
Collapse
Affiliation(s)
- Paloma Acebo
- Unidad de Patología Molecular del Neumococo, Centro Nacional de Microbiología, Instituto de Salud Carlos III, 28220 Majadahonda, Madrid, Spain
| | - Antonio J. Martin-Galiano
- Unidad de Genética Bacteriana, Centro Nacional de Microbiología, Instituto de Salud Carlos III, 28220 Majadahonda, Madrid, Spain
| | - Sara Navarro
- Unidad de Patología Molecular del Neumococo, Centro Nacional de Microbiología, Instituto de Salud Carlos III, 28220 Majadahonda, Madrid, Spain
- CIBER Enfermedades Respiratorias, 07110 Bunyola, Mallorca, Spain
| | - Ángel Zaballos
- Unidad de Genómica, Centro Nacional de Microbiología, Instituto de Salud Carlos III, 28220 Majadahonda, Madrid, Spain
| | - Mónica Amblar
- Unidad de Patología Molecular del Neumococo, Centro Nacional de Microbiología, Instituto de Salud Carlos III, 28220 Majadahonda, Madrid, Spain
- CIBER Enfermedades Respiratorias, 07110 Bunyola, Mallorca, Spain
| |
Collapse
|
23
|
Gerlt JA, Babbitt PC, Jacobson MP, Almo SC. Divergent evolution in enolase superfamily: strategies for assigning functions. J Biol Chem 2011; 287:29-34. [PMID: 22069326 DOI: 10.1074/jbc.r111.240945] [Citation(s) in RCA: 111] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Nature's strategies for evolving catalytic functions can be deciphered from the information contained in the rapidly expanding protein sequence databases. However, the functions of many proteins in the protein sequence and structure databases are either uncertain (too divergent to assign function based on homology) or unknown (no homologs), thereby limiting the utility of the databases. The mechanistically diverse enolase superfamily is a paradigm for understanding the structural bases for evolution of enzymatic function. We describe strategies for assigning functions to members of the enolase superfamily that should be applicable to other superfamilies.
Collapse
Affiliation(s)
- John A Gerlt
- Departments of Biochemistry and Chemistry and The Institute for Genomic Biology, University of Illinois, Urbana, Illinois 61801.
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California, 94143
| | - Matthew P Jacobson
- Pharmaceutical Chemistry, School of Pharmacy, University of California, San Francisco, California 94143
| | - Steven C Almo
- Department of Biochemistry, Albert Einstein College of Medicine, Yeshiva University, Bronx, New York 10461
| |
Collapse
|
24
|
Brown SD, Babbitt PC. Inference of functional properties from large-scale analysis of enzyme superfamilies. J Biol Chem 2011; 287:35-42. [PMID: 22069325 DOI: 10.1074/jbc.r111.283408] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
As increasingly large amounts of data from genome and other sequencing projects become available, new approaches are needed to determine the functions of the proteins these genes encode. We show how large-scale computational analysis can help to address this challenge by linking functional information to sequence and structural similarities using protein similarity networks. Network analyses using three functionally diverse enzyme superfamilies illustrate the use of these approaches for facile updating and comparison of available structures for a large superfamily, for creation of functional hypotheses for metagenomic sequences, and to summarize the limits of our functional knowledge about even well studied superfamilies.
Collapse
Affiliation(s)
- Shoshana D Brown
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California, 94158-2330
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California, 94158-2330; Pharmaceutical Chemistry, School of Pharmacy; California Institute for Quantitative Biosciences, University of California, San Francisco, California 94158-2330.
| |
Collapse
|
25
|
Pribat A, Blaby IK, Lara-Núñez A, Jeanguenin L, Fouquet R, Frelin O, Gregory JF, Philmus B, Begley TP, de Crécy-Lagard V, Hanson AD. A 5-formyltetrahydrofolate cycloligase paralog from all domains of life: comparative genomic and experimental evidence for a cryptic role in thiamin metabolism. Funct Integr Genomics 2011; 11:467-78. [PMID: 21538139 PMCID: PMC6078417 DOI: 10.1007/s10142-011-0224-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2011] [Revised: 03/19/2011] [Accepted: 04/03/2011] [Indexed: 12/18/2022]
Abstract
A paralog (here termed COG0212) of the ATP-dependent folate salvage enzyme 5-formyltetrahydrofolate cycloligase (5-FCL) occurs in all domains of life and, although typically annotated as 5-FCL in pro- and eukaryotic genomes, is of unknown function. COG0212 is similar in overall structure to 5-FCL, particularly in the substrate binding region, and has distant similarity to other kinases. The Arabidopsis thaliana COG0212 protein was shown to be targeted to chloroplasts and to be required for embryo viability. Comparative genomic analysis revealed that a high proportion (19%) of archaeal and bacterial COG0212 genes are clustered on the chromosome with various genes implicated in thiamin metabolism or transport but showed no such association between COG0212 and folate metabolism. Consistent with the bioinformatic evidence for a role in thiamin metabolism, ablating COG0212 in the archaeon Haloferax volcanii caused accumulation of thiamin monophosphate. Biochemical and functional complementation tests of several known and hypothetical thiamin-related activities (involving thiamin, its breakdown products, and their phosphates) were, however, negative. Also consistent with the bioinformatic evidence, the COG0212 proteins from A. thaliana and prokaryote sources lacked 5-FCL activity in vitro and did not complement the growth defect or the characteristic 5-formyltetrahydrofolate accumulation of a 5-FCL-deficient (ΔygfA) Escherichia coli strain. We therefore propose (a) that COG0212 has an unrecognized yet sometimes crucial role in thiamin metabolism, most probably in salvage or detoxification, and (b) that is not a 5-FCL and should no longer be so annotated.
Collapse
Affiliation(s)
- Anne Pribat
- Horticultural Sciences Department, University of Florida, Gainesville, 32611, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Shortridge MD, Triplet T, Revesz P, Griep MA, Powers R. Bacterial protein structures reveal phylum dependent divergence. Comput Biol Chem 2011; 35:24-33. [PMID: 21315656 DOI: 10.1016/j.compbiolchem.2010.12.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2010] [Revised: 12/28/2010] [Accepted: 12/29/2010] [Indexed: 01/26/2023]
Abstract
Protein sequence space is vast compared to protein fold space. This raises important questions about how structures adapt to evolutionary changes in protein sequences. A growing trend is to regard protein fold space as a continuum rather than a series of discrete structures. From this perspective, homologous protein structures within the same functional classification should reveal a constant rate of structural drift relative to sequence changes. The clusters of orthologous groups (COG) classification system was used to annotate homologous bacterial protein structures in the Protein Data Bank (PDB). The structures and sequences of proteins within each COG were compared against each other to establish their relatedness. As expected, the analysis demonstrates a sharp structural divergence between the bacterial phyla Firmicutes and Proteobacteria. Additionally, each COG had a distinct sequence/structure relationship, indicating that different evolutionary pressures affect the degree of structural divergence. However, our analysis also shows the relative drift rate between sequence identity and structure divergence remains constant.
Collapse
Affiliation(s)
- Matthew D Shortridge
- Department of Chemistry, University of Nebraska-Lincoln, 68588-0304, United States
| | | | | | | | | |
Collapse
|
27
|
Renuse S, Chaerkady R, Pandey A. Proteogenomics. Proteomics 2011; 11:620-30. [DOI: 10.1002/pmic.201000615] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2010] [Revised: 11/14/2010] [Accepted: 11/16/2010] [Indexed: 12/13/2022]
|
28
|
Jaeger S, Sers CT, Leser U. Combining modularity, conservation, and interactions of proteins significantly increases precision and coverage of protein function prediction. BMC Genomics 2010; 11:717. [PMID: 21171995 PMCID: PMC3017542 DOI: 10.1186/1471-2164-11-717] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2010] [Accepted: 12/20/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While the number of newly sequenced genomes and genes is constantly increasing, elucidation of their function still is a laborious and time-consuming task. This has led to the development of a wide range of methods for predicting protein functions in silico. We report on a new method that predicts function based on a combination of information about protein interactions, orthology, and the conservation of protein networks in different species. RESULTS We show that aggregation of these independent sources of evidence leads to a drastic increase in number and quality of predictions when compared to baselines and other methods reported in the literature. For instance, our method generates more than 12,000 novel protein functions for human with an estimated precision of ~76%, among which are 7,500 new functional annotations for 1,973 human proteins that previously had zero or only one function annotated. We also verified our predictions on a set of genes that play an important role in colorectal cancer (MLH1, PMS2, EPHB4 ) and could confirm more than 73% of them based on evidence in the literature. CONCLUSIONS The combination of different methods into a single, comprehensive prediction method infers thousands of protein functions for every species included in the analysis at varying, yet always high levels of precision and very good coverage.
Collapse
Affiliation(s)
- Samira Jaeger
- Knowledge Management in Bioinformatics, Humboldt-Universitat zu Berlin Unter den Linden 6, 10099 Berlin, Germany.
| | | | | |
Collapse
|
29
|
Warren AS, Archuleta J, Feng WC, Setubal JC. Missing genes in the annotation of prokaryotic genomes. BMC Bioinformatics 2010; 11:131. [PMID: 20230630 PMCID: PMC3098052 DOI: 10.1186/1471-2105-11-131] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2009] [Accepted: 03/15/2010] [Indexed: 12/04/2022] Open
Abstract
Background Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. Results We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. Conclusions Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.
Collapse
Affiliation(s)
- Andrew S Warren
- Virginia Bioinformatics Institute, Virginia Tech, Blacksburg, VA, USA.
| | | | | | | |
Collapse
|
30
|
'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it. Biochem J 2009; 425:1-11. [PMID: 20001958 DOI: 10.1042/bj20091328] [Citation(s) in RCA: 135] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Like other forms of engineering, metabolic engineering requires knowledge of the components (the 'parts list') of the target system. Lack of such knowledge impairs both rational engineering design and diagnosis of the reasons for failures; it also poses problems for the related field of metabolic reconstruction, which uses a cell's parts list to recreate its metabolic activities in silico. Despite spectacular progress in genome sequencing, the parts lists for most organisms that we seek to manipulate remain highly incomplete, due to the dual problem of 'unknown' proteins and 'orphan' enzymes. The former are all the proteins deduced from genome sequence that have no known function, and the latter are all the enzymes described in the literature (and often catalogued in the EC database) for which no corresponding gene has been reported. Unknown proteins constitute up to about half of the proteins in prokaryotic genomes, and much more than this in higher plants and animals. Orphan enzymes make up more than a third of the EC database. Attacking the 'missing parts list' problem is accordingly one of the great challenges for post-genomic biology, and a tremendous opportunity to discover new facets of life's machinery. Success will require a co-ordinated community-wide attack, sustained over years. In this attack, comparative genomics is probably the single most effective strategy, for it can reliably predict functions for unknown proteins and genes for orphan enzymes. Furthermore, it is cost-efficient and increasingly straightforward to deploy owing to a proliferation of databases and associated tools.
Collapse
|
31
|
Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 2009; 5:e1000605. [PMID: 20011109 PMCID: PMC2781113 DOI: 10.1371/journal.pcbi.1000605] [Citation(s) in RCA: 465] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2009] [Accepted: 11/09/2009] [Indexed: 12/13/2022] Open
Abstract
Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%-63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with "overprediction" of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.
Collapse
Affiliation(s)
- Alexandra M. Schnoes
- Graduate Group in Biophysics, University of California San Francisco, San Francisco, California, United States of America
| | - Shoshana D. Brown
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
| | - Igor Dodevski
- Department of Biochemistry, University of Zürich, Zürich, Switzerland
| | - Patricia C. Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, California, United States of America
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, United States of America
- California Institute for Quantitative Biosciences, University of California San Francisco, San Francisco, California, United States of America
| |
Collapse
|
32
|
Louie B, Higdon R, Kolker E. A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions. PLoS One 2009; 4:e7546. [PMID: 19844580 PMCID: PMC2760442 DOI: 10.1371/journal.pone.0007546] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2009] [Accepted: 09/13/2009] [Indexed: 12/02/2022] Open
Abstract
Background Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity. Methodology Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity. Significance Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e−62, non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e−05, NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.
Collapse
Affiliation(s)
- Brenton Louie
- Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington, United States of America
- Predictive Analytics, Seattle Children's Hospital, University of Washington School of Medicine, Seattle, Washington, United States of America
| | - Roger Higdon
- Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington, United States of America
- Predictive Analytics, Seattle Children's Hospital, University of Washington School of Medicine, Seattle, Washington, United States of America
| | - Eugene Kolker
- Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute, Seattle, Washington, United States of America
- Predictive Analytics, Seattle Children's Hospital, University of Washington School of Medicine, Seattle, Washington, United States of America
- Biomedical and Health Informatics Division, Department of Medical Education and Biomedical Informatics, University of Washington School of Medicine, Seattle, Washington, United States of America
- * E-mail:
| |
Collapse
|
33
|
Liu S, Lee H, Kang PS, Huang X, Yim JH, Lee HK, Kim IC. Complementary DNA library construction and expressed sequence tag analysis of an Arctic moss, Aulacomnium turgidum. Polar Biol 2009. [DOI: 10.1007/s00300-009-0737-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
34
|
Manichaikul A, Ghamsari L, Hom EFY, Lin C, Murray RR, Chang RL, Balaji S, Hao T, Shen Y, Chavali AK, Thiele I, Yang X, Fan C, Mello E, Hill DE, Vidal M, Salehi-Ashtiani K, Papin JA. Metabolic network analysis integrated with transcript verification for sequenced genomes. Nat Methods 2009; 6:589-92. [PMID: 19597503 DOI: 10.1038/nmeth.1348] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2009] [Accepted: 06/17/2009] [Indexed: 01/02/2023]
Abstract
With sequencing of thousands of organisms completed or in progress, there is a growing need to integrate gene prediction with metabolic network analysis. Using Chlamydomonas reinhardtii as a model, we describe a systems-level methodology bridging metabolic network reconstruction with experimental verification of enzyme encoding open reading frames. Our quantitative and predictive metabolic model and its associated cloned open reading frames provide useful resources for metabolic engineering.
Collapse
Affiliation(s)
- Ani Manichaikul
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Surmeli D, Ratmann O, Mewes HW, Tetko IV. FunCat functional inference with belief propagation and feature integration. Comput Biol Chem 2008; 32:375-7. [DOI: 10.1016/j.compbiolchem.2008.06.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2007] [Revised: 06/03/2008] [Accepted: 06/22/2008] [Indexed: 11/26/2022]
|
36
|
Götz S, García-Gómez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, Robles M, Talón M, Dopazo J, Conesa A. High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res 2008; 36:3420-35. [PMID: 18445632 PMCID: PMC2425479 DOI: 10.1093/nar/gkn176] [Citation(s) in RCA: 2896] [Impact Index Per Article: 181.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Functional genomics technologies have been widely adopted in the biological research of both model and non-model species. An efficient functional annotation of DNA or protein sequences is a major requirement for the successful application of these approaches as functional information on gene products is often the key to the interpretation of experimental results. Therefore, there is an increasing need for bioinformatics resources which are able to cope with large amount of sequence data, produce valuable annotation results and are easily accessible to laboratories where functional genomics projects are being undertaken. We present the Blast2GO suite as an integrated and biologist-oriented solution for the high-throughput and automatic functional annotation of DNA or protein sequences based on the Gene Ontology vocabulary. The most outstanding Blast2GO features are: (i) the combination of various annotation strategies and tools controlling type and intensity of annotation, (ii) the numerous graphical features such as the interactive GO-graph visualization for gene-set function profiling or descriptive charts, (iii) the general sequence management features and (iv) high-throughput capabilities. We used the Blast2GO framework to carry out a detailed analysis of annotation behaviour through homology transfer and its impact in functional genomics research. Our aim is to offer biologists useful information to take into account when addressing the task of functionally characterizing their sequence data.
Collapse
Affiliation(s)
- Stefan Götz
- Bioinformatics Department, Centro de Investigación Principe Felipe, Valencia, Spain
| | | | | | | | | | | | | | | | | | | |
Collapse
|
37
|
Tetko IV, Rodchenkov IV, Walter MC, Rattei T, Mewes HW. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. ACTA ACUST UNITED AC 2008; 24:621-8. [PMID: 18174184 DOI: 10.1093/bioinformatics/btm633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat.
Collapse
Affiliation(s)
- Igor V Tetko
- Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH), Institute of Bioinformatics and Systems Biology, Neuherberg, Germany.
| | | | | | | | | |
Collapse
|
38
|
Mewes HW, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer KFX, Münsterkötter M, Ruepp A, Spannagl M, Stümpflen V, Rattei T. MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res 2007; 36:D196-201. [PMID: 18158298 PMCID: PMC2238900 DOI: 10.1093/nar/gkm980] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
The Munich Information Center for Protein Sequences (MIPS-GSF, Neuherberg, Germany) combines automatic processing of large amounts of sequences with manual annotation of selected model genomes. Due to the massive growth of the available data, the depth of annotation varies widely between independent databases. Also, the criteria for the transfer of information from known to orthologous sequences are diverse. To cope with the task of global in-depth genome annotation has become unfeasible. Therefore, our efforts are dedicated to three levels of annotation: (i) the curation of selected genomes, in particular from fungal and plant taxa (e.g. CYGD, MNCDB, MatDB), (ii) the comprehensive, consistent, automatic annotation employing exhaustive methods for the computation of sequence similarities and sequence-related attributes as well as the classification of individual sequences (SIMAP, PEDANT and FunCat) and (iii) the compilation of manually curated databases for protein interactions based on scrutinized information from the literature to serve as an accepted set of reliable annotated interaction data (MPACT, MPPI, CORUM). All databases and tools described as well as the detailed descriptions of our projects can be accessed through the MIPS web server (http://mips.gsf.de).
Collapse
Affiliation(s)
- H W Mewes
- Institute for Bioinformatics (MIPS), German Research Center for Environmental Health, Ingolstaedter Landstrasse 1, D-85764 Neuherberg, Germany
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|