Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For:	Frishman D. Protein annotation at genomic scale: the current status. Chem Rev 2007;107:3448-66. [PMID: 17658902 DOI: 10.1021/cr068303k] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]

Number

Cited by Other Article(s)

Du G, Wu J, Zhang C, Cao X, Li L, He J, Zhang Y, Shang Y. The whole genomic analysis of the Orf virus strains ORFV-SC and ORFV-SC1 from the Sichuan province and their weak pathological response in rabbits. Funct Integr Genomics 2023;23:163. [PMID: 37188892 DOI: 10.1007/s10142-023-01079-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/27/2023] [Accepted: 04/28/2023] [Indexed: 05/17/2023]

Poudel S, Cope AL, O'Dell KB, Guss AM, Seo H, Trinh CT, Hettich RL. Identification and characterization of proteins of unknown function (PUFs) in Clostridium thermocellum DSM 1313 strains as potential genetic engineering targets. BIOTECHNOLOGY FOR BIOFUELS 2021;14:116. [PMID: 33971924 PMCID: PMC8112048 DOI: 10.1186/s13068-021-01964-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 04/26/2021] [Indexed: 05/13/2023]

Frederick J, Hennessy F, Horn U, de la Torre Cortés P, van den Broek M, Strych U, Willson R, Hefer CA, Daran JMG, Sewell T, Otten LG, Brady D. The complete genome sequence of the nitrile biocatalyst Rhodocccus rhodochrous ATCC BAA-870. BMC Genomics 2020;21:3. [PMID: 31898479 PMCID: PMC6941271 DOI: 10.1186/s12864-019-6405-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Accepted: 12/16/2019] [Indexed: 12/21/2022] Open

Abstract

BACKGROUND

Rhodococci are industrially important soil-dwelling Gram-positive bacteria that are well known for both nitrile hydrolysis and oxidative metabolism of aromatics. Rhodococcus rhodochrous ATCC BAA-870 is capable of metabolising a wide range of aliphatic and aromatic nitriles and amides. The genome of the organism was sequenced and analysed in order to better understand this whole cell biocatalyst.

RESULTS

The genome of R. rhodochrous ATCC BAA-870 is the first Rhodococcus genome fully sequenced using Nanopore sequencing. The circular genome contains 5.9 megabase pairs (Mbp) and includes a 0.53 Mbp linear plasmid, that together encode 7548 predicted protein sequences according to BASys annotation, and 5535 predicted protein sequences according to RAST annotation. The genome contains numerous oxidoreductases, 15 identified antibiotic and secondary metabolite gene clusters, several terpene and nonribosomal peptide synthetase clusters, as well as 6 putative clusters of unknown type. The 0.53 Mbp plasmid encodes 677 predicted genes and contains the nitrile converting gene cluster, including a nitrilase, a low molecular weight nitrile hydratase, and an enantioselective amidase. Although there are fewer biotechnologically relevant enzymes compared to those found in rhodococci with larger genomes, such as the well-known Rhodococcus jostii RHA1, the abundance of transporters in combination with the myriad of enzymes found in strain BAA-870 might make it more suitable for use in industrially relevant processes than other rhodococci.

CONCLUSIONS

The sequence and comprehensive description of the R. rhodochrous ATCC BAA-870 genome will facilitate the additional exploitation of rhodococci for biotechnological applications, as well as enable further characterisation of this model organism. The genome encodes a wide range of enzymes, many with unknown substrate specificities supporting potential applications in biotechnology, including nitrilases, nitrile hydratase, monooxygenases, cytochrome P450s, reductases, proteases, lipases, and transaminases.

Collapse

Affiliation(s)

Joni Frederick Protein Technologies, CSIR Biosciences, Meiring Naude Road, Brummeria, Pretoria, South Africa Electron Microscope Unit, University of Cape Town, Rondebosch, 7701 South Africa Present Address: LadHyx, UMR CNRS 7646, École Polytechnique, 91128 Palaiseau, France
Fritha Hennessy Protein Technologies, CSIR Biosciences, Meiring Naude Road, Brummeria, Pretoria, South Africa
Uli Horn Meraka, CSIR, Meiring Naude Road, Brummeria, 0091 South Africa
Pilar de la Torre Cortés Industrial Microbiology, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
Marcel van den Broek Industrial Microbiology, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
Ulrich Strych Biology and Biochemistry, University of Houston, 4800 Calhoun Road, Houston, TX 77204 USA Present Address: Department of Pediatrics, Section of Tropical Medicine, Baylor College of Medicine, 1102 Bates Avenue, Houston, TX 77030 USA
Richard Willson Biology and Biochemistry, University of Houston, 4800 Calhoun Road, Houston, TX 77204 USA Chemical and Biomolecular Engineering, University of Houston, 4800 Calhoun Road, Houston, TX 77204 USA
Charles A. Hefer Bioinformatics and Computational Biology Unit, Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria, 0002 South Africa Present Address: AgResearch Limited, Lincoln Research Centre, Private Bag 4749, Christchurch, 8140 New Zealand
Jean-Marc G. Daran Industrial Microbiology, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
Trevor Sewell Electron Microscope Unit, University of Cape Town, Rondebosch, 7701 South Africa
Linda G. Otten Biocatalysis, Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629 HZ Delft, The Netherlands
Dean Brady Protein Technologies, CSIR Biosciences, Meiring Naude Road, Brummeria, Pretoria, South Africa Molecular Sciences Institute, School of Chemistry, University of the Witwatersrand, PO, Wits, 2050 South Africa

Collapse

Komárek J, Ivanov Kavková E, Houser J, Horáčková A, Ždánská J, Demo G, Wimmerová M. Structure and properties of AB21, a novelAgaricus bisporusprotein with structural relation to bacterial pore-forming toxins. Proteins 2018;86:897-911. [DOI: 10.1002/prot.25522] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2018] [Revised: 04/23/2018] [Accepted: 04/26/2018] [Indexed: 12/13/2022]

Bouadjenek MR, Verspoor K, Zobel J. Literature consistency of bioinformatics sequence databases is effective for assessing record quality. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2017;2017:3074790. [PMID: 28365737 PMCID: PMC5467556 DOI: 10.1093/database/bax021] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2016] [Accepted: 02/20/2017] [Indexed: 11/18/2022]

Abstract

Bioinformatics sequence databases such as Genbank or UniProt contain hundreds of millions of records of genomic data. These records are derived from direct submissions from individual laboratories, as well as from bulk submissions from large-scale sequencing centres; their diversity and scale means that they suffer from a range of data quality issues including errors, discrepancies, redundancies, ambiguities, incompleteness and inconsistencies with the published literature. In this work, we seek to investigate and analyze the data quality of sequence databases from the perspective of a curator, who must detect anomalous and suspicious records. Specifically, we emphasize the detection of inconsistent records with respect to the literature. Focusing on GenBank, we propose a set of 24 quality indicators, which are based on treating a record as a query into the published literature, and then use query quality predictors. We then carry out an analysis that shows that the proposed quality indicators and the quality of the records have a mutual relationship, in which one depends on the other. We propose to represent record-literature consistency as a vector of these quality indicators. By reducing the dimensionality of this representation for visualization purposes using principal component analysis, we show that records which have been reported as inconsistent with the literature fall roughly in the same area, and therefore share similar characteristics. By manually analyzing records not previously known to be erroneous that fall in the same area than records know to be inconsistent, we show that one record out of four is inconsistent with respect to the literature. This high density of inconsistent record opens the way towards the development of automatic methods for the detection of faulty records. We conclude that literature inconsistency is a meaningful strategy for identifying suspicious records.

Database URL: https://github.com/rbouadjenek/DQBioinformatics

Collapse

New chemistry from natural product biosynthesis. Biochem Soc Trans 2016;44:738-44. [DOI: 10.1042/bst20160063] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Indexed: 11/17/2022]

Kumar G, Johnson JL, Frantom PA. Improving Functional Annotation in the DRE-TIM Metallolyase Superfamily through Identification of Active Site Fingerprints. Biochemistry 2016;55:1863-72. [PMID: 26935545 DOI: 10.1021/acs.biochem.5b01193] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Abstract

Within the DRE-TIM metallolyase superfamily, members of the Claisen-like condensation (CC-like) subgroup catalyze C-C bond-forming reactions between various α-ketoacids and acetyl-coenzyme A. These reactions are important in the metabolic pathways of many bacterial pathogens and serve as engineering scaffolds for the production of long-chain alcohol biofuels. To improve functional annotation and identify sequences that might use novel substrates in the CC-like subgroup, a combination of structural modeling and multiple-sequence alignments identified active site residues on the third, fourth, and fifth β-strands of the TIM-barrel catalytic domain that are differentially conserved within the substrate-diverse enzyme families. Using α-isopropylmalate synthase and citramalate synthase from Methanococcus jannaschii (MjIPMS and MjCMS), site-directed mutagenesis was used to test the role of each identified position in substrate selectivity. Kinetic data suggest that residues at the β3-5 and β4-7 positions play a significant role in the selection of α-ketoisovalerate over pyruvate in MjIPMS. However, complementary substitutions in MjCMS fail to alter substrate specificity, suggesting residues in these positions do not contribute to substrate selectivity in this enzyme. Analysis of the kinetic data with respect to a protein similarity network for the CC-like subgroup suggests that evolutionarily distinct forms of IPMS utilize residues at the β3-5 and β4-7 positions to affect substrate selectivity while the different versions of CMS use unique architectures. Importantly, mapping the identities of residues at the β3-5 and β4-7 positions onto the protein similarity network allows for rapid annotation of probable IPMS enzymes as well as several outlier sequences that may represent novel functions in the subgroup.

Collapse

Neuhaus K, Landstorfer R, Fellner L, Simon S, Schafferhans A, Goldberg T, Marx H, Ozoline ON, Rost B, Kuster B, Keim DA, Scherer S. Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157:H7 (EHEC). BMC Genomics 2016;17:133. [PMID: 26911138 PMCID: PMC4765031 DOI: 10.1186/s12864-016-2456-1] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2015] [Accepted: 02/09/2016] [Indexed: 12/30/2022] Open

Abstract

Background

Genomes of E. coli, including that of the human pathogen Escherichia coli O157:H7 (EHEC) EDL933, still harbor undetected protein-coding genes which, apparently, have escaped annotation due to their small size and non-essential function. To find such genes, global gene expression of EHEC EDL933 was examined, using strand-specific RNAseq (transcriptome), ribosomal footprinting (translatome) and mass spectrometry (proteome).

Results

Using the above methods, 72 short, non-annotated protein-coding genes were detected. All of these showed signals in the ribosomal footprinting assay indicating mRNA translation. Seven were verified by mass spectrometry. Fifty-seven genes are annotated in other enterobacteriaceae, mainly as hypothetical genes; the remaining 15 genes constitute novel discoveries. In addition, protein structure and function were predicted computationally and compared between EHEC-encoded proteins and 100-times randomly shuffled proteins. Based on this comparison, 61 of the 72 novel proteins exhibit predicted structural and functional features similar to those of annotated proteins. Many of the novel genes show differential transcription when grown under eleven diverse growth conditions suggesting environmental regulation. Three genes were found to confer a phenotype in previous studies, e.g., decreased cattle colonization.

Conclusions

These findings demonstrate that ribosomal footprinting can be used to detect novel protein coding genes, contributing to the growing body of evidence that hypothetical genes are not annotation artifacts and opening an additional way to study their functionality. All 72 genes are taxonomically restricted and, therefore, appear to have evolved relatively recently de novo.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-016-2456-1) contains supplementary material, which is available to authorized users.

Collapse

Affiliation(s)

Klaus Neuhaus Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
Richard Landstorfer Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
Lea Fellner Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.
Svenja Simon Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
Andrea Schafferhans Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
Tatyana Goldberg Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
Harald Marx Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany.
Olga N Ozoline Institute of Cell Biophysics, Russian Academy of Sciences, Moscow Region, 142290, Pushchino, Russia.
Burkhard Rost Department of Informatics - Bioinformatics & TUM-IAS, Technische Universität München, Boltzmannstraße 3, 85748, Garching, Germany.
Bernhard Kuster Chair of Proteomics and Bioanalytics, Wissenschaftszentrum Weihenstephan, Technische Universität München, Emil-Erlenmeyer-Forum 5, 85354, Freising, Germany. .,Bavarian Center for Biomolecular Mass Spectrometry (BayBioMS), Technische Universität München, Gregor-Mendel-Str. 4, 85354, Freising, Germany.
Daniel A Keim Lehrstuhl für Datenanalyse und Visualisierung, Fachbereich Informatik und Informationswissenschaft, Universität Konstanz, Box 78, 78457, Konstanz, Germany.
Siegfried Scherer Lehrstuhl für Mikrobielle Ökologie, Zentralinstitut für Ernährungs- und Lebensmittelforschung, Wissenschaftszentrum Weihenstephan, Technische Universität München, Weihenstephaner Berg 3, 85354, Freising, Germany.

Collapse

Andrews FH, Horton JD, Shin D, Yoon HJ, Logsdon MG, Malik AM, Rogers MP, Kneen MM, Suh SW, McLeish MJ. The kinetic characterization and X-ray structure of a putative benzoylformate decarboxylase from M. smegmatis highlights the difficulties in the functional annotation of ThDP-dependent enzymes. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015;1854:1001-9. [DOI: 10.1016/j.bbapap.2015.04.027] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/30/2014] [Revised: 04/05/2015] [Accepted: 04/23/2015] [Indexed: 10/23/2022]

Kuznetsova E, Nocek B, Brown G, Makarova KS, Flick R, Wolf YI, Khusnutdinova A, Evdokimova E, Jin K, Tan K, Hanson AD, Hasnain G, Zallot R, de Crécy-Lagard V, Babu M, Savchenko A, Joachimiak A, Edwards AM, Koonin EV, Yakunin AF. Functional Diversity of Haloacid Dehalogenase Superfamily Phosphatases from Saccharomyces cerevisiae: BIOCHEMICAL, STRUCTURAL, AND EVOLUTIONARY INSIGHTS. J Biol Chem 2015;290:18678-98. [PMID: 26071590 DOI: 10.1074/jbc.m115.657916] [Citation(s) in RCA: 66] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2015] [Indexed: 12/15/2022] Open

Affiliation(s)

Ekaterina Kuznetsova From the Structural Genomics Consortium, University of Toronto, Toronto, Ontario M5G 1L7, Canada
Boguslaw Nocek the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
Greg Brown the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
Kira S Makarova the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
Robert Flick the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
Yuri I Wolf the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
Anna Khusnutdinova the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
Elena Evdokimova the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
Ke Jin the Department of Biochemistry, Research and Innovation Centre, University of Regina, Regina, Saskatchewan S4S 0A2, Canada, and
Kemin Tan the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
Andrew D Hanson the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
Ghulam Hasnain the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
Rémi Zallot the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
Valérie de Crécy-Lagard the Horticultural Sciences Department, Department of Microbiology and Cell Science, University of Florida, Gainesville, Florida 32611
Mohan Babu the Department of Biochemistry, Research and Innovation Centre, University of Regina, Regina, Saskatchewan S4S 0A2, Canada, and
Alexei Savchenko the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada
Andrzej Joachimiak the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
Aled M Edwards From the Structural Genomics Consortium, University of Toronto, Toronto, Ontario M5G 1L7, Canada, the Midwest Center for Structural Genomics and Structural Biology Center, Biosciences Division, Argonne National Laboratory, Argonne, Illinois 60439
Eugene V Koonin the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894
Alexander F Yakunin the Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, Ontario M5S 3E5, Canada,

Collapse

Stanberry L, Rekepalli B, Liu Y, Giblock P, Higdon R, Montague E, Broomall W, Kolker N, Kolker E. Optimizing high performance computing workflow for protein functional annotation. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014;26:2112-2121. [PMID: 25313296 PMCID: PMC4194055 DOI: 10.1002/cpe.3264] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]

van der Lee R, Buljan M, Lang B, Weatheritt RJ, Daughdrill GW, Dunker AK, Fuxreiter M, Gough J, Gsponer J, Jones D, Kim PM, Kriwacki R, Oldfield CJ, Pappu RV, Tompa P, Uversky VN, Wright P, Babu MM. Classification of intrinsically disordered regions and proteins. Chem Rev 2014;114:6589-631. [PMID: 24773235 PMCID: PMC4095912 DOI: 10.1021/cr400525m] [Citation(s) in RCA: 1401] [Impact Index Per Article: 140.1] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2013] [Indexed: 12/11/2022]

Affiliation(s)

Robin van der Lee MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, 6500 HB Nijmegen, The Netherlands
Marija Buljan MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
Benjamin Lang MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
Robert J. Weatheritt MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom
Gary W. Daughdrill Department of Cell Biology, Microbiology, and Molecular Biology, University of South Florida, 3720 Spectrum Boulevard, Suite 321, Tampa, Florida 33612, United States
A. Keith Dunker Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, Indiana 46202, United States
Monika Fuxreiter MTA-DE Momentum Laboratory of Protein Dynamics, Department of Biochemistry and Molecular Biology, University of Debrecen, H-4032 Debrecen, Nagyerdei krt 98, Hungary
Julian Gough Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, United Kingdom
Joerg Gsponer Department of Biochemistry and Molecular Biology, Centre for High-Throughput Biology, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
David T. Jones Bioinformatics Group, Department of Computer Science, University College London, London, WC1E 6BT, United Kingdom
Philip M. Kim Terrence Donnelly Centre for Cellular and Biomolecular Research, Department of Molecular Genetics, and Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3E1, Canada
Richard W. Kriwacki Department of Structural Biology, St. Jude Children’s Research Hospital, Memphis, Tennessee 38105, United States
Christopher J. Oldfield Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, Indiana 46202, United States
Rohit V. Pappu Department of Biomedical Engineering and Center for Biological Systems Engineering, Washington University in St. Louis, St. Louis, Missouri 63130, United States
Peter Tompa VIB Department of Structural Biology, Vrije Universiteit Brussel, Brussels, Belgium Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, Budapest, Hungary
Vladimir N. Uversky Department of Molecular Medicine and USF Health Byrd Alzheimer’s Research Institute, Morsani College of Medicine, University of South Florida, Tampa, Florida 33612, United States Institute for Biological Instrumentation, Russian Academy of Sciences, Pushchino, Moscow Region, Russia
Peter E. Wright Department of Integrative Structural and Computational Biology and Skaggs Institute of Chemical Biology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037, United States
M. Madan Babu MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, United Kingdom

Collapse

Carrera J, Estrela R, Luo J, Rai N, Tsoukalas A, Tagkopoulos I. An integrative, multi-scale, genome-wide model reveals the phenotypic landscape of Escherichia coli. Mol Syst Biol 2014;10:735. [PMID: 24987114 PMCID: PMC4299492 DOI: 10.15252/msb.20145108] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

de Crécy-Lagard V. Variations in metabolic pathways create challenges for automated metabolic reconstructions: Examples from the tetrahydrofolate synthesis pathway. Comput Struct Biotechnol J 2014;10:41-50. [PMID: 25210598 PMCID: PMC4151868 DOI: 10.1016/j.csbj.2014.05.008] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open

Rapid identification of sequences for orphan enzymes to power accurate protein annotation. PLoS One 2013;8:e84508. [PMID: 24386392 PMCID: PMC3875567 DOI: 10.1371/journal.pone.0084508] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2013] [Accepted: 11/21/2013] [Indexed: 11/19/2022] Open

Liberal R, Pinney JW. Simple topological properties predict functional misannotations in a metabolic network. Bioinformatics 2013;29:i154-61. [PMID: 23812979 PMCID: PMC3694667 DOI: 10.1093/bioinformatics/btt236] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open

Abstract

Motivation: Misannotation in sequence databases is an important obstacle for automated tools for gene function annotation, which rely extensively on comparison with sequences with known function. To improve current annotations and prevent future propagation of errors, sequence-independent tools are, therefore, needed to assist in the identification of misannotated gene products. In the case of enzymatic functions, each functional assignment implies the existence of a reaction within the organism’s metabolic network; a first approximation to a genome-scale metabolic model can be obtained directly from an automated genome annotation. Any obvious problems in the network, such as dead end or disconnected reactions, can, therefore, be strong indications of misannotation.

Results: We demonstrate that a machine-learning approach using only network topological features can successfully predict the validity of enzyme annotations. The predictions are tested at three different levels. A random forest using topological features of the metabolic network and trained on curated sets of correct and incorrect enzyme assignments was found to have an accuracy of up to 86% in 5-fold cross-validation experiments. Further cross-validation against unseen enzyme superfamilies indicates that this classifier can successfully extrapolate beyond the classes of enzyme present in the training data. The random forest model was applied to several automated genome annotations, achieving an accuracy of in most cases when validated against recent genome-scale metabolic models. We also observe that when applied to draft metabolic networks for multiple species, a clear negative correlation is observed between predicted annotation quality and phylogenetic distance to the major model organism for biochemistry (Escherichia coli for prokaryotes and Homo sapiens for eukaryotes).

Contact:j.pinney@imperial.ac.uk

Supplementary information:Supplementary data are available at Bioinformatics online.

Collapse

Comparative genomics approaches to understanding and manipulating plant metabolism. Curr Opin Biotechnol 2013;24:278-84. [DOI: 10.1016/j.copbio.2012.07.005] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Revised: 07/29/2012] [Accepted: 07/30/2012] [Indexed: 12/11/2022]

Blais EM, Chavali AK, Papin JA. Linking genome-scale metabolic modeling and genome annotation. Methods Mol Biol 2013;985:61-83. [PMID: 23417799 DOI: 10.1007/978-1-62703-299-5_4] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]

Structural analysis of hypothetical proteins from Helicobacter pylori: an approach to estimate functions of unknown or hypothetical proteins. Int J Mol Sci 2012;13:7109-7137. [PMID: 22837682 PMCID: PMC3397514 DOI: 10.3390/ijms13067109] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2012] [Revised: 05/29/2012] [Accepted: 06/01/2012] [Indexed: 12/12/2022] Open

Jaeger S, Aloy P. From protein interaction networks to novel therapeutic strategies. IUBMB Life 2012;64:529-37. [DOI: 10.1002/iub.1040] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2012] [Accepted: 03/14/2012] [Indexed: 01/18/2023]

Seaver SMD, Henry CS, Hanson AD. Frontiers in metabolic reconstruction and modeling of plant genomes. JOURNAL OF EXPERIMENTAL BOTANY 2012;63:2247-58. [PMID: 22238452 DOI: 10.1093/jxb/err371] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]

Acebo P, Martin-Galiano AJ, Navarro S, Zaballos Á, Amblar M. Identification of 88 regulatory small RNAs in the TIGR4 strain of the human pathogen Streptococcus pneumoniae. RNA (NEW YORK, N.Y.) 2012;18:530-546. [PMID: 22274957 PMCID: PMC3285940 DOI: 10.1261/rna.027359.111] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2011] [Accepted: 12/02/2011] [Indexed: 05/31/2023]

Gerlt JA, Babbitt PC, Jacobson MP, Almo SC. Divergent evolution in enolase superfamily: strategies for assigning functions. J Biol Chem 2011;287:29-34. [PMID: 22069326 DOI: 10.1074/jbc.r111.240945] [Citation(s) in RCA: 111] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open

Brown SD, Babbitt PC. Inference of functional properties from large-scale analysis of enzyme superfamilies. J Biol Chem 2011;287:35-42. [PMID: 22069325 DOI: 10.1074/jbc.r111.283408] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

Pribat A, Blaby IK, Lara-Núñez A, Jeanguenin L, Fouquet R, Frelin O, Gregory JF, Philmus B, Begley TP, de Crécy-Lagard V, Hanson AD. A 5-formyltetrahydrofolate cycloligase paralog from all domains of life: comparative genomic and experimental evidence for a cryptic role in thiamin metabolism. Funct Integr Genomics 2011;11:467-78. [PMID: 21538139 PMCID: PMC6078417 DOI: 10.1007/s10142-011-0224-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2011] [Revised: 03/19/2011] [Accepted: 04/03/2011] [Indexed: 12/18/2022]

Shortridge MD, Triplet T, Revesz P, Griep MA, Powers R. Bacterial protein structures reveal phylum dependent divergence. Comput Biol Chem 2011;35:24-33. [PMID: 21315656 DOI: 10.1016/j.compbiolchem.2010.12.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2010] [Revised: 12/28/2010] [Accepted: 12/29/2010] [Indexed: 01/26/2023]

Renuse S, Chaerkady R, Pandey A. Proteogenomics. Proteomics 2011;11:620-30. [DOI: 10.1002/pmic.201000615] [Citation(s) in RCA: 106] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2010] [Revised: 11/14/2010] [Accepted: 11/16/2010] [Indexed: 12/13/2022]

Jaeger S, Sers CT, Leser U. Combining modularity, conservation, and interactions of proteins significantly increases precision and coverage of protein function prediction. BMC Genomics 2010;11:717. [PMID: 21171995 PMCID: PMC3017542 DOI: 10.1186/1471-2164-11-717] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2010] [Accepted: 12/20/2010] [Indexed: 11/10/2022] Open

Warren AS, Archuleta J, Feng WC, Setubal JC. Missing genes in the annotation of prokaryotic genomes. BMC Bioinformatics 2010;11:131. [PMID: 20230630 PMCID: PMC3098052 DOI: 10.1186/1471-2105-11-131] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2009] [Accepted: 03/15/2010] [Indexed: 12/04/2022] Open

'Unknown' proteins and 'orphan' enzymes: the missing half of the engineering parts list--and how to find it. Biochem J 2009;425:1-11. [PMID: 20001958 DOI: 10.1042/bj20091328] [Citation(s) in RCA: 135] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]

Schnoes AM, Brown SD, Dodevski I, Babbitt PC. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput Biol 2009;5:e1000605. [PMID: 20011109 PMCID: PMC2781113 DOI: 10.1371/journal.pcbi.1000605] [Citation(s) in RCA: 465] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2009] [Accepted: 11/09/2009] [Indexed: 12/13/2022] Open

Louie B, Higdon R, Kolker E. A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions. PLoS One 2009;4:e7546. [PMID: 19844580 PMCID: PMC2760442 DOI: 10.1371/journal.pone.0007546] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2009] [Accepted: 09/13/2009] [Indexed: 12/02/2022] Open

Abstract

Background

Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity.

Methodology

Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity.

Significance

Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e⁻⁶², non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e⁻⁰⁵, NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.

Collapse

Liu S, Lee H, Kang PS, Huang X, Yim JH, Lee HK, Kim IC. Complementary DNA library construction and expressed sequence tag analysis of an Arctic moss, Aulacomnium turgidum. Polar Biol 2009. [DOI: 10.1007/s00300-009-0737-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Manichaikul A, Ghamsari L, Hom EFY, Lin C, Murray RR, Chang RL, Balaji S, Hao T, Shen Y, Chavali AK, Thiele I, Yang X, Fan C, Mello E, Hill DE, Vidal M, Salehi-Ashtiani K, Papin JA. Metabolic network analysis integrated with transcript verification for sequenced genomes. Nat Methods 2009;6:589-92. [PMID: 19597503 DOI: 10.1038/nmeth.1348] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2009] [Accepted: 06/17/2009] [Indexed: 01/02/2023]

Surmeli D, Ratmann O, Mewes HW, Tetko IV. FunCat functional inference with belief propagation and feature integration. Comput Biol Chem 2008;32:375-7. [DOI: 10.1016/j.compbiolchem.2008.06.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2007] [Revised: 06/03/2008] [Accepted: 06/22/2008] [Indexed: 11/26/2022]

Götz S, García-Gómez JM, Terol J, Williams TD, Nagaraj SH, Nueda MJ, Robles M, Talón M, Dopazo J, Conesa A. High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res 2008;36:3420-35. [PMID: 18445632 PMCID: PMC2425479 DOI: 10.1093/nar/gkn176] [Citation(s) in RCA: 2896] [Impact Index Per Article: 181.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open

Tetko IV, Rodchenkov IV, Walter MC, Rattei T, Mewes HW. Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information. ACTA ACUST UNITED AC 2008;24:621-8. [PMID: 18174184 DOI: 10.1093/bioinformatics/btm633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Mewes HW, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer KFX, Münsterkötter M, Ruepp A, Spannagl M, Stümpflen V, Rattei T. MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res 2007;36:D196-201. [PMID: 18158298 PMCID: PMC2238900 DOI: 10.1093/nar/gkm980] [Citation(s) in RCA: 119] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open