1
|
Michael-Pitschaze T, Cohen N, Ofer D, Hoshen Y, Linial M. Detecting anomalous proteins using deep representations. NAR Genom Bioinform 2024; 6:lqae021. [PMID: 38486884 PMCID: PMC10939404 DOI: 10.1093/nargab/lqae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 11/17/2023] [Accepted: 02/23/2024] [Indexed: 03/17/2024] Open
Abstract
Many advances in biomedicine can be attributed to identifying unusual proteins and genes. Many of these proteins' unique properties were discovered by manual inspection, which is becoming infeasible at the scale of modern protein datasets. Here, we propose to tackle this challenge using anomaly detection methods that automatically identify unexpected properties. We adopt a state-of-the-art anomaly detection paradigm from computer vision, to highlight unusual proteins. We generate meaningful representations without labeled inputs, using pretrained deep neural network models. We apply these protein language models (pLM) to detect anomalies in function, phylogenetic families, and segmentation tasks. We compute protein anomaly scores to highlight human prion-like proteins, distinguish viral proteins from their host proteome, and mark non-classical ion/metal binding proteins and enzymes. Other tasks concern segmentation of protein sequences into folded and unstructured regions. We provide candidates for rare functionality (e.g. prion proteins). Additionally, we show the anomaly score is useful in 3D folding-related segmentation. Our novel method shows improved performance over strong baselines and has objectively high performance across a variety of tasks. We conclude that the combination of pLM and anomaly detection techniques is a valid method for discovering a range of global and local protein characteristics.
Collapse
Affiliation(s)
- Tomer Michael-Pitschaze
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Niv Cohen
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Dan Ofer
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Yedid Hoshen
- The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Michal Linial
- Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
2
|
Han SR, Park M, Kosaraju S, Lee J, Lee H, Lee JH, Oh TJ, Kang M. Evidential deep learning for trustworthy prediction of enzyme commission number. Brief Bioinform 2023; 25:bbad401. [PMID: 37991247 PMCID: PMC10664415 DOI: 10.1093/bib/bbad401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 09/25/2023] [Accepted: 10/19/2023] [Indexed: 11/23/2023] Open
Abstract
The rapid growth of uncharacterized enzymes and their functional diversity urge accurate and trustworthy computational functional annotation tools. However, current state-of-the-art models lack trustworthiness on the prediction of the multilabel classification problem with thousands of classes. Here, we demonstrate that a novel evidential deep learning model (named ECPICK) makes trustworthy predictions of enzyme commission (EC) numbers with data-driven domain-relevant evidence, which results in significantly enhanced predictive power and the capability to discover potential new motif sites. ECPICK learns complex sequential patterns of amino acids and their hierarchical structures from 20 million enzyme data. ECPICK identifies significant amino acids that contribute to the prediction without multiple sequence alignment. Our intensive assessment showed not only outstanding enhancement of predictive performance on the largest databases of Uniprot, Protein Data Bank (PDB) and Kyoto Encyclopedia of Genes and Genomes (KEGG), but also a capability to discover new motif sites in microorganisms. ECPICK is a reliable EC number prediction tool to identify protein functions of an increasing number of uncharacterized enzymes.
Collapse
Affiliation(s)
- So-Ra Han
- Department of Life Science and Biochemical Engineering, Sun Moon University, Asan, Republic of Korea
- Bio Big Data-based Chungnam Smart Clean Research Leader Training Program, SunMoon University, Asan, Republic of Korea
| | - Mingyu Park
- Bio Big Data-based Chungnam Smart Clean Research Leader Training Program, SunMoon University, Asan, Republic of Korea
- Division of Computer Science and Engineering, Sun Moon University, Asan, Republic of Korea
| | - Sai Kosaraju
- Department of Computer Science, University of Nevada, Las Vegas, NV, USA
| | - JeungMin Lee
- Bio Big Data-based Chungnam Smart Clean Research Leader Training Program, SunMoon University, Asan, Republic of Korea
- Division of Computer Science and Engineering, Sun Moon University, Asan, Republic of Korea
| | - Hyun Lee
- Bio Big Data-based Chungnam Smart Clean Research Leader Training Program, SunMoon University, Asan, Republic of Korea
- Division of Computer Science and Engineering, Sun Moon University, Asan, Republic of Korea
- Genome-based BioIT Convergence Institute, Asan, Republic of Korea
| | - Jun Hyuck Lee
- Research Unit of Cryogenic Novel Material, Korea Polar Research Institute, Incheon, Republic of Korea
| | - Tae-Jin Oh
- Department of Life Science and Biochemical Engineering, Sun Moon University, Asan, Republic of Korea
- Bio Big Data-based Chungnam Smart Clean Research Leader Training Program, SunMoon University, Asan, Republic of Korea
- Genome-based BioIT Convergence Institute, Asan, Republic of Korea
- Department of Pharmaceutical Engineering and Biotechnology, Sun Moon University, Asan, Republic of Korea
| | - Mingon Kang
- Department of Computer Science, University of Nevada, Las Vegas, NV, USA
| |
Collapse
|
3
|
Vezina B, Watts SC, Hawkey J, Cooper HB, Judd LM, Jenney AWJ, Monk JM, Holt KE, Wyres KL. Bactabolize is a tool for high-throughput generation of bacterial strain-specific metabolic models. eLife 2023; 12:RP87406. [PMID: 37815531 PMCID: PMC10564454 DOI: 10.7554/elife.87406] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/11/2023] Open
Abstract
Metabolic capacity can vary substantially within a bacterial species, leading to ecological niche separation, as well as differences in virulence and antimicrobial susceptibility. Genome-scale metabolic models are useful tools for studying the metabolic potential of individuals, and with the rapid expansion of genomic sequencing there is a wealth of data that can be leveraged for comparative analysis. However, there exist few tools to construct strain-specific metabolic models at scale. Here, we describe Bactabolize, a reference-based tool which rapidly produces strain-specific metabolic models and growth phenotype predictions. We describe a pan reference model for the priority antimicrobial-resistant pathogen, Klebsiella pneumoniae, and a quality control framework for using draft genome assemblies as input for Bactabolize. The Bactabolize-derived model for K. pneumoniae reference strain KPPR1 performed comparatively or better than currently available automated approaches CarveMe and gapseq across 507 substrate and 2317 knockout mutant growth predictions. Novel draft genomes passing our systematically defined quality control criteria resulted in models with a high degree of completeness (≥99% genes and reactions captured compared to models derived from matched complete genomes) and high accuracy (mean 0.97, n=10). We anticipate the tools and framework described herein will facilitate large-scale metabolic modelling analyses that broaden our understanding of diversity within bacterial species and inform novel control strategies for priority pathogens.
Collapse
Affiliation(s)
- Ben Vezina
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Stephen C Watts
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Jane Hawkey
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Helena B Cooper
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Louise M Judd
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | | | - Jonathan M Monk
- Department of Bioengineering, University of California, San DiegoSan DiegoUnited States
| | - Kathryn E Holt
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
- Department of Infection Biology, London School of Hygiene & Tropical MedicineLondonUnited Kingdom
| | - Kelly L Wyres
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| |
Collapse
|
4
|
Robben M, Nasr MS, Das A, Veerla JP, Huber M, Jaworski J, Weidanz J, Luber J. Comparison of the Strengths and Weaknesses of Machine Learning Algorithms and Feature Selection on KEGG Database Microbial Gene Pathway Annotation and Its Effects on Reconstructed Network Topology. J Comput Biol 2023; 30:766-782. [PMID: 37437088 DOI: 10.1089/cmb.2022.0370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/14/2023] Open
Abstract
The development of tools for the annotation of genes from newly sequenced species has not evolved much from homologous alignment to prior annotated species. While the quality of gene annotations continues to decline as we sequence and assemble more evolutionary distant gut microbiome species, machine learning presents a high quality alternative to traditional techniques. In this study, we investigate the relative performance of common classical and nonclassical machine learning algorithms in the problem of gene annotation using human microbiome-associated species genes from the KEGG database. The majority of the ensemble, clustering, and deep learning algorithms that we investigated showed higher prediction accuracy than CD-Hit in predicting partial KEGG function. Motif-based, machine-learning methods of annotation in new species were faster and had higher precision-recall than methods of homologous alignment or orthologous gene clustering. Gradient boosted ensemble methods and neural networks also predicted higher connectivity in reconstructed KEGG pathways, finding twice as many new pathway interactions than blast alignment. The use of motif-based, machine-learning algorithms in annotation software will allow researchers to develop powerful tools to interact with bacterial microbiomes in ways previously unachievable through homologous sequence alignment alone.
Collapse
Affiliation(s)
- Michael Robben
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Mohammad Sadegh Nasr
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Avishek Das
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Jai Prakash Veerla
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Manfred Huber
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Justyn Jaworski
- Department of Bioengineering, and University of Texas at Arlington, Arlington, Texas, USA
| | - Jon Weidanz
- Department of Kinesiology, University of Texas at Arlington, Arlington, Texas, USA
| | - Jacob Luber
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| |
Collapse
|
5
|
Rivera Pérez CA, Janz D, Schneider D, Daniel R, Polle A. Transcriptional Landscape of Ectomycorrhizal Fungi and Their Host Provides Insight into N Uptake from Forest Soil. mSystems 2022; 7:e0095721. [PMID: 35089084 PMCID: PMC8725588 DOI: 10.1128/msystems.00957-21] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Accepted: 11/29/2021] [Indexed: 01/05/2023] Open
Abstract
Mineral nitrogen (N) is a major nutrient showing strong fluctuations in the environment due to anthropogenic activities. The acquisition and translocation of N to forest trees are achieved mainly by highly diverse ectomycorrhizal fungi (EMF) living in symbioses with their host roots. Here, we examined colonized root tips to characterize the entire root-associated fungal community by DNA metabarcoding-Illumina sequencing of the fungal internal transcribed spacer 2 (ITS2) molecular marker and used RNA sequencing to target metabolically active fungi and the plant transcriptome after N application. The study was conducted with beech (Fagus sylvatica L.), a dominant tree species in central Europe, grown in native forest soil. We demonstrate strong enrichment of 15N from nitrate or ammonium in the ectomycorrhizal roots by stable-isotope labeling. The relative abundance of the EMF members in the fungal community was correlated with their transcriptional abundances. The fungal metatranscriptome covered Kyoto Encyclopedia of Genes and Genomes (KEGG) and Eukaryotic Orthologous Groups (KOG) categories similar to those of model fungi and did not reveal significant changes related to N metabolization but revealed species-specific transcription patterns, supporting trait stability. In contrast to the resistance of the fungal metatranscriptome, the transcriptome of the host exhibited dedicated nitrate- or ammonium-responsive changes with the upregulation of transporters and enzymes required for nitrate reduction and a drastic enhancement of glutamine synthetase transcript levels, indicating the channeling of ammonium into the pathway for plant protein biosynthesis. Our results support that naturally assembled fungal communities living in association with the tree roots buffer nutritional signals in their own metabolism but do not shield plants from high environmental N levels. IMPORTANCE Although EMF are well known for their role in supporting tree N nutrition, the molecular mechanisms underlying N flux from the soil solution into the host through the ectomycorrhizal pathway remain widely unknown. Furthermore, ammonium and nitrate availability in the soil solution is subject to frequent oscillations that create a dynamic environment for the tree roots and associated microbes during N acquisition. Therefore, it is important to understand how root-associated mycobiomes and the tree roots handle these fluctuations. We studied the responses of the symbiotic partners by screening their transcriptomes after a sudden environmental flux of nitrate or ammonium. We show that the fungi and the host respond asynchronously, with the fungi displaying resistance to increased nitrate or ammonium and the host dynamically metabolizing the supplied N sources. This study provides insights into the molecular mechanisms of the symbiotic partners operating under N enrichment in a multidimensional symbiotic system.
Collapse
Affiliation(s)
- Carmen Alicia Rivera Pérez
- Forest Botany and Tree Physiology, Büsgen Institute, Georg-August University of Göttingen, Göttingen, Germany
| | - Dennis Janz
- Forest Botany and Tree Physiology, Büsgen Institute, Georg-August University of Göttingen, Göttingen, Germany
| | - Dominik Schneider
- Department of Genomic and Applied Microbiology, Institute of Microbiology and Genetics, Georg-August University of Göttingen, Göttingen, Germany
- Göttingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August University of Göttingen, Göttingen, Germany
| | - Rolf Daniel
- Department of Genomic and Applied Microbiology, Institute of Microbiology and Genetics, Georg-August University of Göttingen, Göttingen, Germany
- Göttingen Genomics Laboratory, Institute of Microbiology and Genetics, Georg-August University of Göttingen, Göttingen, Germany
| | - Andrea Polle
- Forest Botany and Tree Physiology, Büsgen Institute, Georg-August University of Göttingen, Göttingen, Germany
| |
Collapse
|
6
|
Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme class. PLoS Comput Biol 2021; 17:e1009446. [PMID: 34555022 PMCID: PMC8491902 DOI: 10.1371/journal.pcbi.1009446] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 10/05/2021] [Accepted: 09/13/2021] [Indexed: 12/12/2022] Open
Abstract
Only a small fraction of genes deposited to databases have been experimentally characterised. The majority of proteins have their function assigned automatically, which can result in erroneous annotations. The reliability of current annotations in public databases is largely unknown; experimental attempts to validate the accuracy within individual enzyme classes are lacking. In this study we performed an overview of functional annotations to the BRENDA enzyme database. We first applied a high-throughput experimental platform to verify functional annotations to an enzyme class of S-2-hydroxyacid oxidases (EC 1.1.3.15). We chose 122 representative sequences of the class and screened them for their predicted function. Based on the experimental results, predicted domain architecture and similarity to previously characterised S-2-hydroxyacid oxidases, we inferred that at least 78% of sequences in the enzyme class are misannotated. We experimentally confirmed four alternative activities among the misannotated sequences and showed that misannotation in the enzyme class increased over time. Finally, we performed a computational analysis of annotations to all enzyme classes in the BRENDA database, and showed that nearly 18% of all sequences are annotated to an enzyme class while sharing no similarity or domain architecture to experimentally characterised representatives. We showed that even well-studied enzyme classes of industrial relevance are affected by the problem of functional misannotation. Correct annotation of genomes is crucial for our understanding and utilization of functional gene diversity, yet the reliability of current protein annotations in public databases is largely unknown. In our work we validated annotations to an S-2-hydroxyacid oxidase enzyme class (EC 1.1.3.15) by assessing activity of 122 representative sequences in a high-throughput screening experiment. From this dataset we inferred that at least 78% of the sequences in the enzyme class are misannotated, and confirmed four alternative activities among the misannotated sequences. We showed that the misannotation is widespread throughout enzyme classes, affecting even well-studied classes of industrial relevance. Overall, our study highlights the value of experimental and computational validation of predicted functions within individual enzyme classes.
Collapse
|
7
|
Caspi R, Billington R, Keseler IM, Kothari A, Krummenacker M, Midford PE, Ong WK, Paley S, Subhraveti P, Karp PD. The MetaCyc database of metabolic pathways and enzymes - a 2019 update. Nucleic Acids Res 2020; 48:D445-D453. [PMID: 31586394 PMCID: PMC6943030 DOI: 10.1093/nar/gkz862] [Citation(s) in RCA: 537] [Impact Index Per Article: 134.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2019] [Revised: 09/19/2019] [Accepted: 10/01/2019] [Indexed: 11/18/2022] Open
Abstract
MetaCyc (MetaCyc.org) is a comprehensive reference database of metabolic pathways and enzymes from all domains of life. It contains 2749 pathways derived from more than 60 000 publications, making it the largest curated collection of metabolic pathways. The data in MetaCyc are evidence-based and richly curated, resulting in an encyclopedic reference tool for metabolism. MetaCyc is also used as a knowledge base for generating thousands of organism-specific Pathway/Genome Databases (PGDBs), which are available in BioCyc.org and other genomic portals. This article provides an update on the developments in MetaCyc during September 2017 to August 2019, up to version 23.1. Some of the topics that received intensive curation during this period include cobamides biosynthesis, sterol metabolism, fatty acid biosynthesis, lipid metabolism, carotenoid metabolism, protein glycosylation, antibiotics and cytotoxins biosynthesis, siderophore biosynthesis, bioluminescence, vitamin K metabolism, brominated compound metabolism, plant secondary metabolism and human metabolism. Other additions include modifications to the GlycanBuilder software that enable displaying glycans using symbolic representation, improved graphics and fonts for web displays, improvements in the PathoLogic component of Pathway Tools, and the optional addition of regulatory information to pathway diagrams.
Collapse
Affiliation(s)
- Ron Caspi
- SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA
| | | | - Ingrid M Keseler
- SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA
| | - Anamika Kothari
- SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA
| | | | - Peter E Midford
- SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA
| | - Wai Kit Ong
- SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA
| | - Suzanne Paley
- SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA
| | | | - Peter D Karp
- SRI International, 333 Ravenswood Ave, Menlo Park, CA 94025, USA
| |
Collapse
|
8
|
Santiago CRDN, Assis RDAB, Moreira LM, Digiampietri LA. Gene Tags Assessment by Comparative Genomics (GTACG): A User-Friendly Framework for Bacterial Comparative Genomics. Front Genet 2019; 10:725. [PMID: 31507629 PMCID: PMC6718126 DOI: 10.3389/fgene.2019.00725] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2019] [Accepted: 07/10/2019] [Indexed: 12/04/2022] Open
Abstract
Genomics research has produced an exponential amount of data. However, the genetic knowledge pertaining to certain phenotypic characteristics is lacking. Also, a considerable part of these genomes have coding sequences (CDSs) with unknown functions, posing additional challenges to researchers. Phylogenetically close microorganisms share much of their CDSs, and certain phenotypes unique to a set of microorganisms may be the result of the genes found exclusively in those microorganisms. This study presents the GTACG framework, an easy-to-use tool for identifying in the subgroups of bacterial genomes whose microorganisms have common phenotypic characteristics, to find data that differentiates them from other associated genomes in a simple and fast way. The GTACG analysis is based on the formation of homologous CDS clusters from local alignments. The front-end is easy to use, and the installation packages have been developed to enable users lacking knowledge of programming languages or bioinformatics analyze high-throughput data using the tool. The validation of the GTACG framework has been carried out based on a case report involving a set of 161 genomes from the Xanthomonadaceae family, in which 19 families of orthologous proteins were found in 90% of the plant-associated genomes, allowing the identification of the proteins potentially associated with adaptation and virulence in plant tissue. The results show the potential use of GTACG in the search for new targets for molecular studies, and GTACG can be used as a research tool by biologists who lack advanced knowledge in the use of computational tools for bacterial comparative genomics.
Collapse
Affiliation(s)
| | - Renata de Almeida Barbosa Assis
- Biotecnology Graduate Program, Núcleo de Pesquisas em Ciências Biológicas, Federal University of Ouro Preto, Ouro Preto, Brazil
| | - Leandro Marcio Moreira
- Biotecnology Graduate Program, Núcleo de Pesquisas em Ciências Biológicas, Federal University of Ouro Preto, Ouro Preto, Brazil
- Department of Biological Sciences, Federal University of Ouro Preto, Ouro Preto, Brazil
| | - Luciano Antonio Digiampietri
- Bioinformatics Graduate Program, University of Sao Paulo, Sao Paulo, Brazil
- School of Arts, Science, and Humanities, University of Sao Paulo, Sao Paulo, Brazil
| |
Collapse
|
9
|
Caspi R, Billington R, Fulcher CA, Keseler IM, Kothari A, Krummenacker M, Latendresse M, Midford PE, Ong Q, Ong WK, Paley S, Subhraveti P, Karp PD. The MetaCyc database of metabolic pathways and enzymes. Nucleic Acids Res 2019; 46:D633-D639. [PMID: 29059334 PMCID: PMC5753197 DOI: 10.1093/nar/gkx935] [Citation(s) in RCA: 526] [Impact Index Per Article: 105.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2017] [Accepted: 10/02/2017] [Indexed: 01/27/2023] Open
Abstract
MetaCyc (https://MetaCyc.org) is a comprehensive reference database of metabolic pathways and enzymes from all domains of life. It contains more than 2570 pathways derived from >54 000 publications, making it the largest curated collection of metabolic pathways. The data in MetaCyc is strictly evidence-based and richly curated, resulting in an encyclopedic reference tool for metabolism. MetaCyc is also used as a knowledge base for generating thousands of organism-specific Pathway/Genome Databases (PGDBs), which are available in the BioCyc (https://BioCyc.org) and other PGDB collections. This article provides an update on the developments in MetaCyc during the past two years, including the expansion of data and addition of new features.
Collapse
Affiliation(s)
- Ron Caspi
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| | | | - Carol A Fulcher
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| | | | - Anamika Kothari
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| | | | | | - Peter E Midford
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| | - Quang Ong
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| | - Wai Kit Ong
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| | - Suzanne Paley
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| | | | - Peter D Karp
- SRI International, 333 Ravenswood, Menlo Park, CA 94025, USA
| |
Collapse
|
10
|
Cai Y, Yang H, Li W, Liu G, Lee PW, Tang Y. Multiclassification Prediction of Enzymatic Reactions for Oxidoreductases and Hydrolases Using Reaction Fingerprints and Machine Learning Methods. J Chem Inf Model 2018; 58:1169-1181. [PMID: 29733642 DOI: 10.1021/acs.jcim.7b00656] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Drug metabolism is a complex procedure in the human body, including a series of enzymatically catalyzed reactions. However, it is costly and time consuming to investigate drug metabolism experimentally; computational methods are hence developed to predict drug metabolism and have shown great advantages. As the first step, classification of metabolic reactions and enzymes is highly desirable for drug metabolism prediction. In this study, we developed multiclassification models for prediction of reaction types catalyzed by oxidoreductases and hydrolases, in which three reaction fingerprints were used to describe the reactions and seven machine learnings algorithms were employed for model building. Data retrieved from KEGG containing 1055 hydrolysis and 2510 redox reactions were used to build the models, respectively. The external validation data consisted of 213 hydrolysis and 512 redox reactions extracted from the Rhea database. The best models were built by neural network or logistic regression with a 2048-bit transformation reaction fingerprint. The predictive accuracies of the main class, subclass, and superclass classification models on external validation sets were all above 90%. This study will be very helpful for enzymatic reaction annotation and further study on metabolism prediction.
Collapse
Affiliation(s)
- Yingchun Cai
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Hongbin Yang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Weihua Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Guixia Liu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Philip W Lee
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| | - Yun Tang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy , East China University of Science and Technology , Shanghai 200237 , China
| |
Collapse
|
11
|
Magnúsdóttir S, Thiele I. Modeling metabolism of the human gut microbiome. Curr Opin Biotechnol 2017; 51:90-96. [PMID: 29258014 DOI: 10.1016/j.copbio.2017.12.005] [Citation(s) in RCA: 77] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Revised: 12/05/2017] [Accepted: 12/06/2017] [Indexed: 12/15/2022]
Abstract
The human gut microbiome plays an important part in human health. The complexity of the microbiome makes it difficult to determine the detailed metabolic functions and cross-talk occurs between the individual species. In silico systems biology studies of the microbiome can help to identify metabolite exchanges among gut microbes. Constraint-based reconstruction and analysis methods use biochemically accurate genome-scale metabolic networks of microorganisms to simulate metabolism between species in a given microbiome and help generate novel hypotheses on microbial interactions. Here, we review metabolic modeling studies that have investigated metabolic functions of the gut microbiome.
Collapse
Affiliation(s)
- Stefanía Magnúsdóttir
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Ines Thiele
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg.
| |
Collapse
|
12
|
Ellens KW, Christian N, Singh C, Satagopam VP, May P, Linster CL. Confronting the catalytic dark matter encoded by sequenced genomes. Nucleic Acids Res 2017; 45:11495-11514. [PMID: 29059321 PMCID: PMC5714238 DOI: 10.1093/nar/gkx937] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Accepted: 10/03/2017] [Indexed: 01/02/2023] Open
Abstract
The post-genomic era has provided researchers with a deluge of protein sequences. However, a significant fraction of the proteins encoded by sequenced genomes remains without an identified function. Here, we aim at determining how many enzymes of uncertain or unknown function are still present in the Saccharomyces cerevisiae and human proteomes. Using information available in the Swiss-Prot, BRENDA and KEGG databases in combination with a Hidden Markov Model-based method, we estimate that >600 yeast and 2000 human proteins (>30% of their proteins of unknown function) are enzymes whose precise function(s) remain(s) to be determined. This illustrates the impressive scale of the ‘unknown enzyme problem’. We extensively review classical biochemical as well as more recent systematic experimental and computational approaches that can be used to support enzyme function discovery research. Finally, we discuss the possible roles of the elusive catalysts in light of recent developments in the fields of enzymology and metabolism as well as the significance of the unknown enzyme problem in the context of metabolic modeling, metabolic engineering and rare disease research.
Collapse
Affiliation(s)
- Kenneth W Ellens
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4362 Esch-sur-Alzette, Luxembourg
| | - Nils Christian
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4362 Esch-sur-Alzette, Luxembourg
| | - Charandeep Singh
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4362 Esch-sur-Alzette, Luxembourg
| | - Venkata P Satagopam
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4362 Esch-sur-Alzette, Luxembourg
| | - Patrick May
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4362 Esch-sur-Alzette, Luxembourg
| | - Carole L Linster
- Luxembourg Centre for Systems Biomedicine, University of Luxembourg, L-4362 Esch-sur-Alzette, Luxembourg
| |
Collapse
|
13
|
Tsui IF, Chari R, Buys TP, Lam WL. Public Databases and Software for the Pathway Analysis of Cancer Genomes. Cancer Inform 2017. [DOI: 10.1177/117693510700300027] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The study of pathway disruption is key to understanding cancer biology. Advances in high throughput technologies have led to the rapid accumulation of genomic data. The explosion in available data has generated opportunities for investigation of concerted changes that disrupt biological functions, this in turns created a need for computational tools for pathway analysis. In this review, we discuss approaches to the analysis of genomic data and describe the publicly available resources for studying biological pathways.
Collapse
Affiliation(s)
- Ivy F.L. Tsui
- Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, and Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| | - Raj Chari
- Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, and Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| | - Timon P.H. Buys
- Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, and Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| | - Wan L. Lam
- Cancer Genetics and Developmental Biology, British Columbia Cancer Research Centre, and Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, Canada
| |
Collapse
|
14
|
Tanner JJ. Empirical power laws for the radii of gyration of protein oligomers. Acta Crystallogr D Struct Biol 2016; 72:1119-1129. [PMID: 27710933 PMCID: PMC5053138 DOI: 10.1107/s2059798316013218] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2016] [Accepted: 08/16/2016] [Indexed: 11/10/2022] Open
Abstract
The radius of gyration is a fundamental structural parameter that is particularly useful for describing polymers. It has been known since Flory's seminal work in the mid-20th century that polymers show a power-law dependence, where the radius of gyration is proportional to the number of residues raised to a power. The power-law exponent has been measured experimentally for denatured proteins and derived empirically for folded monomeric proteins using crystal structures. Here, the biological assemblies in the Protein Data Bank are surveyed to derive the power-law parameters for protein oligomers having degrees of oligomerization of 2-6 and 8. The power-law exponents for oligomers span a narrow range of 0.38-0.41, which is close to the value of 0.40 obtained for monomers. This result shows that protein oligomers exhibit essentially the same power-law behavior as monomers. A simple power-law formula is provided for estimating the oligomeric state from an experimental measurement of the radius of gyration. Several proteins in the Protein Data Bank are found to deviate substantially from power-law behavior by having an atypically large radius of gyration. Some of the outliers have highly elongated structures, such as coiled coils. For coiled coils, the radius of gyration does not follow a power law and instead scales linearly with the number of residues in the oligomer. Other outliers are proteins whose oligomeric state or quaternary structure is incorrectly annotated in the Protein Data Bank. The power laws could be used to identify such errors and help prevent them in future depositions.
Collapse
Affiliation(s)
- John J. Tanner
- Departments of Biochemistry and Chemistry, University of Missouri-Columbia, Columbia, MO 65211, USA
| |
Collapse
|
15
|
Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional Annotations of Paralogs: A Blessing and a Curse. Life (Basel) 2016; 6:life6030039. [PMID: 27618105 PMCID: PMC5041015 DOI: 10.3390/life6030039] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 08/29/2016] [Accepted: 09/02/2016] [Indexed: 12/15/2022] Open
Abstract
Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines.
Collapse
Affiliation(s)
- Rémi Zallot
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Katherine J Harrison
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| |
Collapse
|
16
|
Boari de Lima E, Meira W, de Melo-Minardi RC. Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering. PLoS Comput Biol 2016; 12:e1005001. [PMID: 27348631 PMCID: PMC4922564 DOI: 10.1371/journal.pcbi.1005001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2015] [Accepted: 05/22/2016] [Indexed: 01/14/2023] Open
Abstract
As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.
Collapse
Affiliation(s)
- Elisa Boari de Lima
- Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Wagner Meira
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | | |
Collapse
|
17
|
Ferrarini MG, Siqueira FM, Mucha SG, Palama TL, Jobard É, Elena-Herrmann B, R Vasconcelos AT, Tardy F, Schrank IS, Zaha A, Sagot MF. Insights on the virulence of swine respiratory tract mycoplasmas through genome-scale metabolic modeling. BMC Genomics 2016; 17:353. [PMID: 27178561 PMCID: PMC4866288 DOI: 10.1186/s12864-016-2644-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2016] [Accepted: 04/22/2016] [Indexed: 12/24/2022] Open
Abstract
Background The respiratory tract of swine is colonized by several bacteria among which are three Mycoplasma species: Mycoplasma flocculare, Mycoplasma hyopneumoniae and Mycoplasma hyorhinis. While colonization by M. flocculare is virtually asymptomatic, M. hyopneumoniae is the causative agent of enzootic pneumonia and M. hyorhinis is present in cases of pneumonia, polyserositis and arthritis. The genomic resemblance among these three Mycoplasma species combined with their different levels of pathogenicity is an indication that they have unknown mechanisms of virulence and differential expression, as for most mycoplasmas. Methods In this work, we performed whole-genome metabolic network reconstructions for these three mycoplasmas. Cultivation tests and metabolomic experiments through nuclear magnetic resonance spectroscopy (NMR) were also performed to acquire experimental data and further refine the models reconstructed in silico. Results Even though the refined models have similar metabolic capabilities, interesting differences include a wider range of carbohydrate uptake in M. hyorhinis, which in turn may also explain why this species is a widely contaminant in cell cultures. In addition, the myo-inositol catabolism is exclusive to M. hyopneumoniae and may be an important trait for virulence. However, the most important difference seems to be related to glycerol conversion to dihydroxyacetone-phosphate, which produces toxic hydrogen peroxide. This activity, missing only in M. flocculare, may be directly involved in cytotoxicity, as already described for two lung pathogenic mycoplasmas, namely Mycoplasma pneumoniae in human and Mycoplasma mycoides subsp. mycoides in ruminants. Metabolomic data suggest that even though these mycoplasmas are extremely similar in terms of genome and metabolism, distinct products and reaction rates may be the result of differential expression throughout the species. Conclusions We were able to infer from the reconstructed networks that the lack of pathogenicity of M. flocculare if compared to the highly pathogenic M. hyopneumoniae may be related to its incapacity to produce cytotoxic hydrogen peroxide. Moreover, the ability of M. hyorhinis to grow in diverse sites and even in different hosts may be a reflection of its enhanced and wider carbohydrate uptake. Altogether, the metabolic differences highlighted in silico and in vitro provide important insights to the different levels of pathogenicity observed in each of the studied species. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2644-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mariana G Ferrarini
- ERABLE, Inria, 43, Bd du 11 Novembre 1918, Villeurbanne, France.,CBiot, UFRGS, Av Bento Gon'calves, Porto Alegre, 9500, Brazil.,Laboratoire de Biométrie et Biologie Évolutive, Université de Lyon, 43, Bd du 11 Novembre 1918, Villeurbanne, France
| | | | - Scheila G Mucha
- CBiot, UFRGS, Av Bento Gon'calves, Porto Alegre, 9500, Brazil
| | - Tony L Palama
- Université de Lyon, Institut des Sciences Analytiques (CNRS, ENS Lyon, Université Lyon 1), 5, Rue de la Doua, Villeurbanne, France.,Current address: LISBP - INSA Toulouse, Toulouse, France
| | - Élodie Jobard
- Université de Lyon, Institut des Sciences Analytiques (CNRS, ENS Lyon, Université Lyon 1), 5, Rue de la Doua, Villeurbanne, France
| | - Bénédicte Elena-Herrmann
- Université de Lyon, Institut des Sciences Analytiques (CNRS, ENS Lyon, Université Lyon 1), 5, Rue de la Doua, Villeurbanne, France.,Université de Lyon, Centre Léon Bérard, Département d'oncologie médicale, 28, rue Laënnec, Lyon, France
| | - Ana T R Vasconcelos
- Laboratório Nacional de Computaćão Científica, Av. Getúlio Vargas, 333, Petrópolis, Brazil
| | - Florence Tardy
- Anses, Laboratoire de Lyon, UMR Mycoplasmoses des Ruminants, 31, Av Tony Garnier, Lyon, France.,Université de Lyon, VetAgro Sup, UMR Mycoplasmoses des Ruminants, 1 Avenue Bourgelat, Marcy L'Étoile, France
| | - Irene S Schrank
- CBiot, UFRGS, Av Bento Gon'calves, Porto Alegre, 9500, Brazil
| | - Arnaldo Zaha
- CBiot, UFRGS, Av Bento Gon'calves, Porto Alegre, 9500, Brazil
| | - Marie-France Sagot
- ERABLE, Inria, 43, Bd du 11 Novembre 1918, Villeurbanne, France. .,Laboratoire de Biométrie et Biologie Évolutive, Université de Lyon, 43, Bd du 11 Novembre 1918, Villeurbanne, France.
| |
Collapse
|
18
|
Kuwahara H, Alazmi M, Cui X, Gao X. MRE: a web tool to suggest foreign enzymes for the biosynthesis pathway design with competing endogenous reactions in mind. Nucleic Acids Res 2016; 44:W217-25. [PMID: 27131375 PMCID: PMC4987905 DOI: 10.1093/nar/gkw342] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Accepted: 04/18/2016] [Indexed: 01/01/2023] Open
Abstract
To rationally design a productive heterologous biosynthesis system, it is essential to consider the suitability of foreign reactions for the specific endogenous metabolic infrastructure of a host. We developed a novel web server, called MRE, which, for a given pair of starting and desired compounds in a given chassis organism, ranks biosynthesis routes from the perspective of the integration of new reactions into the endogenous metabolic system. For each promising heterologous biosynthesis pathway, MRE suggests actual enzymes for foreign metabolic reactions and generates information on competing endogenous reactions for the consumption of metabolites. These unique, chassis-centered features distinguish MRE from existing pathway design tools and allow synthetic biologists to evaluate the design of their biosynthesis systems from a different angle. By using biosynthesis of a range of high-value natural products as a case study, we show that MRE is an effective tool to guide the design and optimization of heterologous biosynthesis pathways. The URL of MRE is http://www.cbrc.kaust.edu.sa/mre/.
Collapse
Affiliation(s)
- Hiroyuki Kuwahara
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955, Saudi Arabia
| | - Meshari Alazmi
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955, Saudi Arabia
| | - Xuefeng Cui
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955, Saudi Arabia
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Thuwal, 23955, Saudi Arabia
| |
Collapse
|
19
|
Dönertaş HM, Martínez Cuesta S, Rahman SA, Thornton JM. Characterising Complex Enzyme Reaction Data. PLoS One 2016; 11:e0147952. [PMID: 26840640 PMCID: PMC4740462 DOI: 10.1371/journal.pone.0147952] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2015] [Accepted: 01/11/2016] [Indexed: 01/05/2023] Open
Abstract
The relationship between enzyme-catalysed reactions and the Enzyme Commission (EC) number, the widely accepted classification scheme used to characterise enzyme activity, is complex and with the rapid increase in our knowledge of the reactions catalysed by enzymes needs revisiting. We present a manual and computational analysis to investigate this complexity and found that almost one-third of all known EC numbers are linked to more than one reaction in the secondary reaction databases (e.g., KEGG). Although this complexity is often resolved by defining generic, alternative and partial reactions, we have also found individual EC numbers with more than one reaction catalysing different types of bond changes. This analysis adds a new dimension to our understanding of enzyme function and might be useful for the accurate annotation of the function of enzymes and to study the changes in enzyme function during evolution.
Collapse
Affiliation(s)
- Handan Melike Dönertaş
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
- Department of Biological Sciences, Middle East Technical University, Ankara, Turkey
| | - Sergio Martínez Cuesta
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Syed Asad Rahman
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
| | - Janet M. Thornton
- European Molecular Biology Laboratory, European Bioinformatics Institute EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
20
|
Promponas VJ, Iliopoulos I, Ouzounis CA. Annotation inconsistencies beyond sequence similarity-based function prediction - phylogeny and genome structure. Stand Genomic Sci 2015; 10:108. [PMID: 26594309 PMCID: PMC4653902 DOI: 10.1186/s40793-015-0101-2] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2015] [Accepted: 11/11/2015] [Indexed: 12/15/2022] Open
Abstract
The function annotation process in computational biology has increasingly shifted from the traditional characterization of individual biochemical roles of protein molecules to the system-wide detection of entire metabolic pathways and genomic structures. The so-called genome-aware methods broaden misannotation inconsistencies in genome sequences beyond protein function assignments, encompassing phylogenetic anomalies and artifactual genomic regions. We outline three categories of error propagation in databases by providing striking examples – at various levels of appreciation by the community from traditional to emerging, thus raising awareness for future solutions.
Collapse
Affiliation(s)
- Vasilis J Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, PO Box 20537, CY-1678 Nicosia, Cyprus
| | - Ioannis Iliopoulos
- Division of Medical Sciences, University of Crete Medical School, GR-71110 Heraklion, Greece
| | - Christos A Ouzounis
- Biological Computation & Process Laboratory (BCPL), Chemical Process & Energy Resources Institute (CPERI), Centre for Research & Technology Hellas (CERTH), PO Box 361, GR-57001 Thessalonica, Greece
| |
Collapse
|
21
|
Larocque M, Chénard T, Najmanovich R. A curated C. difficile strain 630 metabolic network: prediction of essential targets and inhibitors. BMC SYSTEMS BIOLOGY 2014; 8:117. [PMID: 25315994 PMCID: PMC4207893 DOI: 10.1186/s12918-014-0117-z] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/30/2014] [Accepted: 10/08/2014] [Indexed: 12/12/2022]
Abstract
BACKGROUND Clostridium difficile is the leading cause of hospital-borne infections occurring when the natural intestinal flora is depleted following antibiotic treatment. Current treatments for Clostridium difficile infections present high relapse rates and new hyper-virulent and multi-resistant strains are emerging, making the study of this nosocomial pathogen necessary to find novel therapeutic targets. RESULTS We present iMLTC806cdf, an extensively curated reconstructed metabolic network for the C. difficile pathogenic strain 630. iMLTC806cdf contains 806 genes, 703 metabolites and 769 metabolic, 117 exchange and 145 transport reactions. iMLTC806cdf is the most complete and accurate metabolic reconstruction of a gram-positive anaerobic bacteria to date. We validate the model with simulated growth assays in different media and carbon sources and use it to predict essential genes. We obtain 89.2% accuracy in the prediction of gene essentiality when compared to experimental data for B. subtilis homologs (the closest organism for which such data exists). We predict the existence of 76 essential genes and 39 essential gene pairs, a number of which are unique to C. difficile and have non-existing or predicted non-essential human homologs. For 29 of these potential therapeutic targets, we find 125 inhibitors of homologous proteins including approved drugs with the potential for drug repositioning, that when validated experimentally could serve as starting points in the development of new antibiotics. CONCLUSIONS We created a highly curated metabolic network model of C. difficile strain 630 and used it to predict essential genes as potential new therapeutic targets in the fight against Clostridium difficile infections.
Collapse
Affiliation(s)
- Mathieu Larocque
- Department of Biochemistry, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, QC, J1H 5N4, Canada.
| | - Thierry Chénard
- Department of Biochemistry, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, QC, J1H 5N4, Canada.
| | - Rafael Najmanovich
- Department of Biochemistry, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, QC, J1H 5N4, Canada.
| |
Collapse
|
22
|
Gene network biological validity based on gene-gene interaction relevance. ScientificWorldJournal 2014; 2014:540679. [PMID: 25295303 PMCID: PMC4175387 DOI: 10.1155/2014/540679] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2014] [Accepted: 07/11/2014] [Indexed: 01/17/2023] Open
Abstract
In recent years, gene networks have become one of the most useful tools for modeling biological processes. Many inference gene network algorithms have been developed as techniques for extracting knowledge from gene expression data. Ensuring the reliability of the inferred gene relationships is a crucial task in any study in order to prove that the algorithms used are precise. Usually, this validation process can be carried out using prior biological knowledge. The metabolic pathways stored in KEGG are one of the most widely used knowledgeable sources for analyzing relationships between genes. This paper introduces a new methodology, GeneNetVal, to assess the biological validity of gene networks based on the relevance of the gene-gene interactions stored in KEGG metabolic pathways. Hence, a complete KEGG pathway conversion into a gene association network and a new matching distance based on gene-gene interaction relevance are proposed. The performance of GeneNetVal was established with three different experiments. Firstly, our proposal is tested in a comparative ROC analysis. Secondly, a randomness study is presented to show the behavior of GeneNetVal when the noise is increased in the input network. Finally, the ability of GeneNetVal to detect biological functionality of the network is shown.
Collapse
|
23
|
Sorokina M, Stam M, Médigue C, Lespinet O, Vallenet D. Profiling the orphan enzymes. Biol Direct 2014; 9:10. [PMID: 24906382 PMCID: PMC4084501 DOI: 10.1186/1745-6150-9-10] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2014] [Accepted: 05/29/2014] [Indexed: 11/10/2022] Open
Abstract
The emergence of Next Generation Sequencing generates an incredible amount of sequence and great potential for new enzyme discovery. Despite this huge amount of data and the profusion of bioinformatic methods for function prediction, a large part of known enzyme activities is still lacking an associated protein sequence. These particular activities are called "orphan enzymes". The present review proposes an update of previous surveys on orphan enzymes by mining the current content of public databases. While the percentage of orphan enzyme activities has decreased from 38% to 22% in ten years, there are still more than 1,000 orphans among the 5,000 entries of the Enzyme Commission (EC) classification. Taking into account all the reactions present in metabolic databases, this proportion dramatically increases to reach nearly 50% of orphans and many of them are not associated to a known pathway. We extended our survey to "local orphan enzymes" that are activities which have no representative sequence in a given clade, but have at least one in organisms belonging to other clades. We observe an important bias in Archaea and find that in general more than 30% of the EC activities have incomplete sequence information in at least one superkingdom. To estimate if candidate proteins for local orphans could be retrieved by homology search, we applied a simple strategy based on the PRIAM software and noticed that candidates may be proposed for an important fraction of local orphan enzymes. Finally, by studying relation between protein domains and catalyzed activities, it appears that newly discovered enzymes are mostly associated with already known enzyme domains. Thus, the exploration of the promiscuity and the multifunctional aspect of known enzyme families may solve part of the orphan enzyme issue. We conclude this review with a presentation of recent initiatives in finding proteins for orphan enzymes and in extending the enzyme world by the discovery of new activities.
Collapse
Affiliation(s)
- Maria Sorokina
- Direction des Sciences du Vivant, Commissariat à l'Energie Atomique (CEA), Institut de Génomique, Genoscope, Laboratoire d'Analyses Bioinformatiques pour la Génomique et le Métabolisme, 2 rue Gaston Crémieux, 91057 Evry, France.
| | | | | | | | | |
Collapse
|
24
|
Silveira SDA, de Melo-Minardi RC, da Silveira CH, Santoro MM, Meira Jr W. ENZYMAP: exploiting protein annotation for modeling and predicting EC number changes in UniProt/Swiss-Prot. PLoS One 2014; 9:e89162. [PMID: 24586563 PMCID: PMC3929618 DOI: 10.1371/journal.pone.0089162] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2013] [Accepted: 01/19/2014] [Indexed: 11/18/2022] Open
Abstract
The volume and diversity of biological data are increasing at very high rates. Vast amounts of protein sequences and structures, protein and genetic interactions and phenotype studies have been produced. The majority of data generated by high-throughput devices is automatically annotated because manually annotating them is not possible. Thus, efficient and precise automatic annotation methods are required to ensure the quality and reliability of both the biological data and associated annotations. We proposed ENZYMatic Annotation Predictor (ENZYMAP), a technique to characterize and predict EC number changes based on annotations from UniProt/Swiss-Prot using a supervised learning approach. We evaluated ENZYMAP experimentally, using test data sets from both UniProt/Swiss-Prot and UniProt/TrEMBL, and showed that predicting EC changes using selected types of annotation is possible. Finally, we compared ENZYMAP and DETECT with respect to their predictions and checked both against the UniProt/Swiss-Prot annotations. ENZYMAP was shown to be more accurate than DETECT, coming closer to the actual changes in UniProt/Swiss-Prot. Our proposal is intended to be an automatic complementary method (that can be used together with other techniques like the ones based on protein sequence and structure) that helps to improve the quality and reliability of enzyme annotations over time, suggesting possible corrections, anticipating annotation changes and propagating the implicit knowledge for the whole dataset.
Collapse
Affiliation(s)
- Sabrina de Azevedo Silveira
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- * E-mail: (SAS); (WM)
| | | | | | - Marcelo Matos Santoro
- Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
| | - Wagner Meira Jr
- Department of Computer Science, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil
- * E-mail: (SAS); (WM)
| |
Collapse
|
25
|
Stobbe MD, Swertz MA, Thiele I, Rengaw T, van Kampen AHC, Moerland PD. Consensus and conflict cards for metabolic pathway databases. BMC SYSTEMS BIOLOGY 2013; 7:50. [PMID: 23803311 PMCID: PMC3703255 DOI: 10.1186/1752-0509-7-50] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2012] [Accepted: 06/20/2013] [Indexed: 01/04/2023]
Abstract
Background The metabolic network of H. sapiens and many other organisms is described in multiple pathway databases. The level of agreement between these descriptions, however, has proven to be low. We can use these different descriptions to our advantage by identifying conflicting information and combining their knowledge into a single, more accurate, and more complete description. This task is, however, far from trivial. Results We introduce the concept of Consensus and Conflict Cards (C2Cards) to provide concise overviews of what the databases do or do not agree on. Each card is centered at a single gene, EC number or reaction. These three complementary perspectives make it possible to distinguish disagreements on the underlying biology of a metabolic process from differences that can be explained by different decisions on how and in what detail to represent knowledge. As a proof-of-concept, we implemented C2CardsHuman, as a web application http://www.molgenis.org/c2cards, covering five human pathway databases. Conclusions C2Cards can contribute to ongoing reconciliation efforts by simplifying the identification of consensus and conflicts between pathway databases and lowering the threshold for experts to contribute. Several case studies illustrate the potential of the C2Cards in identifying disagreements on the underlying biology of a metabolic process. The overviews may also point out controversial biological knowledge that should be subject of further research. Finally, the examples provided emphasize the importance of manual curation and the need for a broad community involvement.
Collapse
Affiliation(s)
- Miranda D Stobbe
- Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, Amsterdam 1100 DE, the Netherlands
| | | | | | | | | | | |
Collapse
|
26
|
Feng X, Zhuang WQ, Colletti P, Tang YJ. Metabolic pathway determination and flux analysis in nonmodel microorganisms through 13C-isotope labeling. Methods Mol Biol 2012; 881:309-30. [PMID: 22639218 DOI: 10.1007/978-1-61779-827-6_11] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
C-isotope labeling is a commonly used technique for determining and quantifying pathways in microorganisms under various growth conditions. The experimental protocol consists of feeding the cell with a composition-defined substrate and measuring isotopic labeling patterns in the synthesized metabolites (often the amino acids). Not only can the labeling information be cross-referenced with genomic information to identify the novel pathways, but it can also be used to decipher absolute carbon fluxes through the metabolic network of interest. This technique can be widely used for functional characterization of nonmodel microbial species, and thus we provide a (13)C-pathway and flux analysis protocol. The five key procedures are: (1) growing cells using labeled substrates, (2) measuring extracellular metabolite and biomass component, (3) analyzing isotopic labeling patterns in amino acids and central metabolites using gas chromatography-mass spectrometry, (4) tracing (13)C carbon transitions in metabolites and discovering new pathways, and (5) estimating flux distributions based on isotopomer constraints. This protocol provides complementary information to the recently published protocol for (13)C-based metabolic flux analysis of the model species Escherichia coli (Nat Protoc 4:878-892, 2009).
Collapse
Affiliation(s)
- Xueyang Feng
- Department of Energy, Environmental and Chemical Engineering, Washington University, St. Louis, MO, USA
| | | | | | | |
Collapse
|
27
|
Stobbe MD, Houten SM, Jansen GA, van Kampen AHC, Moerland PD. Critical assessment of human metabolic pathway databases: a stepping stone for future integration. BMC SYSTEMS BIOLOGY 2011; 5:165. [PMID: 21999653 PMCID: PMC3271347 DOI: 10.1186/1752-0509-5-165] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2011] [Accepted: 10/14/2011] [Indexed: 01/17/2023]
Abstract
Background Multiple pathway databases are available that describe the human metabolic network and have proven their usefulness in many applications, ranging from the analysis and interpretation of high-throughput data to their use as a reference repository. However, so far the various human metabolic networks described by these databases have not been systematically compared and contrasted, nor has the extent to which they differ been quantified. For a researcher using these databases for particular analyses of human metabolism, it is crucial to know the extent of the differences in content and their underlying causes. Moreover, the outcomes of such a comparison are important for ongoing integration efforts. Results We compared the genes, EC numbers and reactions of five frequently used human metabolic pathway databases. The overlap is surprisingly low, especially on reaction level, where the databases agree on 3% of the 6968 reactions they have combined. Even for the well-established tricarboxylic acid cycle the databases agree on only 5 out of the 30 reactions in total. We identified the main causes for the lack of overlap. Importantly, the databases are partly complementary. Other explanations include the number of steps a conversion is described in and the number of possible alternative substrates listed. Missing metabolite identifiers and ambiguous names for metabolites also affect the comparison. Conclusions Our results show that each of the five networks compared provides us with a valuable piece of the puzzle of the complete reconstruction of the human metabolic network. To enable integration of the networks, next to a need for standardizing the metabolite names and identifiers, the conceptual differences between the databases should be resolved. Considerable manual intervention is required to reach the ultimate goal of a unified and biologically accurate model for studying the systems biology of human metabolism. Our comparison provides a stepping stone for such an endeavor.
Collapse
Affiliation(s)
- Miranda D Stobbe
- Bioinformatics Laboratory, Academic Medical Center, University of Amsterdam, PO Box 22700, 1100 DE, Amsterdam, the Netherlands
| | | | | | | | | |
Collapse
|
28
|
Roberts RJ, Chang YC, Hu Z, Rachlin JN, Anton BP, Pokrzywa RM, Choi HP, Faller LL, Guleria J, Housman G, Klitgord N, Mazumdar V, McGettrick MG, Osmani L, Swaminathan R, Tao KR, Letovsky S, Vitkup D, Segrè D, Salzberg SL, Delisi C, Steffen M, Kasif S. COMBREX: a project to accelerate the functional annotation of prokaryotic genomes. Nucleic Acids Res 2010; 39:D11-4. [PMID: 21097892 PMCID: PMC3013729 DOI: 10.1093/nar/gkq1168] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
COMBREX (http://combrex.bu.edu) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists and a mechanism for experimental biochemists to bid for the validation of those predictions. Small grants are available to support successful bids.
Collapse
|
29
|
Hung SS, Wasmuth J, Sanford C, Parkinson J. DETECT--a density estimation tool for enzyme classification and its application to Plasmodium falciparum. ACTA ACUST UNITED AC 2010; 26:1690-8. [PMID: 20513663 DOI: 10.1093/bioinformatics/btq266] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION A major challenge in genomics is the accurate annotation of component genes. Enzymes are typically predicted using homology-based search methods, where the membership of a protein to an enzyme family is based on single-sequence comparisons. As such, these methods are often error-prone and lack useful measures of reliability for the prediction. RESULTS Here, we present DETECT, a probabilistic method for enzyme prediction that accounts for the sequence diversity across enzyme families. By comparing the global alignment scores of an unknown protein to those of all known enzymes, an integrated likelihood score can be readily calculated, ranking the reaction classes relevant for that protein. Comparisons to BLAST reveal significant improvements in enzyme annotation accuracy. Applied to Plasmodium falciparum, we identify potential annotation errors and predict novel enzymes of therapeutic interest. AVAILABILITY A standalone application is available from the website: http://www.compsysbio.org/projects/DETECT/
Collapse
Affiliation(s)
- Stacy S Hung
- Program in Molecular Structure and Function, Hospital for Sick Children, 15-704 MaRS TMDT East, 101 College Street, Toronto, ON M5G 1L7, Canada
| | | | | | | |
Collapse
|
30
|
|
31
|
Pinzon A, Rodriguez-R LM, Gonzalez A, Bernal A, Restrepo S. Targeted metabolic reconstruction: a novel approach for the characterization of plant-pathogen interactions. Brief Bioinform 2010; 12:151-62. [DOI: 10.1093/bib/bbq009] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
32
|
Abstract
The chemical industry is currently undergoing a dramatic change driven by demand for developing more sustainable processes for the production of fuels, chemicals, and materials. In biotechnological processes different microorganisms can be exploited, and the large diversity of metabolic reactions represents a rich repository for the design of chemical conversion processes that lead to efficient production of desirable products. However, often microorganisms that produce a desirable product, either naturally or because they have been engineered through insertion of heterologous pathways, have low yields and productivities, and in order to establish an economically viable process it is necessary to improve the performance of the microorganism. Here metabolic engineering is the enabling technology. Through metabolic engineering the metabolic landscape of the microorganism is engineered such that there is an efficient conversion of the raw material, typically glucose, to the product of interest. This process may involve both insertion of new enzymes activities, deletion of existing enzyme activities, but often also deregulation of existing regulatory structures operating in the cell. In order to rapidly identify the optimal metabolic engineering strategy the industry is to an increasing extent looking into the use of tools from systems biology. This involves both x-ome technologies such as transcriptome, proteome, metabolome, and fluxome analysis, and advanced mathematical modeling tools such as genome-scale metabolic modeling. Here we look into the history of these different techniques and review how they find application in industrial biotechnology, which will lead to what we here define as industrial systems biology.
Collapse
Affiliation(s)
- José Manuel Otero
- Department of Chemical and Biological Engineering, Chalmers University of Technology, Göteborg, Sweden
| | | |
Collapse
|
33
|
Grossetête S, Labedan B, Lespinet O. FUNGIpath: a tool to assess fungal metabolic pathways predicted by orthology. BMC Genomics 2010; 11:81. [PMID: 20122162 PMCID: PMC2829015 DOI: 10.1186/1471-2164-11-81] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2009] [Accepted: 02/01/2010] [Indexed: 11/29/2022] Open
Abstract
Background More and more completely sequenced fungal genomes are becoming available and many more sequencing projects are in progress. This deluge of data should improve our knowledge of the various primary and secondary metabolisms of Fungi, including their synthesis of useful compounds such as antibiotics or toxic molecules such as mycotoxins. Functional annotation of many fungal genomes is imperfect, especially of genes encoding enzymes, so we need dedicated tools to analyze their metabolic pathways in depth. Description FUNGIpath is a new tool built using a two-stage approach. Groups of orthologous proteins predicted using complementary methods of detection were collected in a relational database. Each group was further mapped on to steps in the metabolic pathways published in the public databases KEGG and MetaCyc. As a result, FUNGIpath allows the primary and secondary metabolisms of the different fungal species represented in the database to be compared easily, making it possible to assess the level of specificity of various pathways at different taxonomic distances. It is freely accessible at http://www.fungipath.u-psud.fr. Conclusions As more and more fungal genomes are expected to be sequenced during the coming years, FUNGIpath should help progressively to reconstruct the ancestral primary and secondary metabolisms of the main branches of the fungal tree of life and to elucidate the evolution of these ancestral fungal metabolisms to various specific derived metabolisms.
Collapse
Affiliation(s)
- Sandrine Grossetête
- Institut de Génétique et de Microbiologie, Université Paris-Sud 11, CNRS UMR 8621, Bâtiment 400, 91405 Orsay Cedex, France
| | | | | |
Collapse
|
34
|
Affiliation(s)
- Kimmen Sjölander
- Department of Bioengineering, University of California, Berkeley, Berkeley, California, United States of America.
| |
Collapse
|
35
|
Hsiao TL, Revelles O, Chen L, Sauer U, Vitkup D. Automatic policing of biochemical annotations using genomic correlations. Nat Chem Biol 2009; 6:34-40. [PMID: 19935659 PMCID: PMC2935526 DOI: 10.1038/nchembio.266] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2009] [Accepted: 09/10/2009] [Indexed: 11/09/2022]
Abstract
With the increasing role of computational tools in the analysis of sequenced genomes, there is an urgent need to maintain high accuracy of functional annotations. Misannotations can be easily generated and propagated through databases by functional transfer based on sequence homology. We developed and optimized an automatic policing method to detect biochemical misannotations using context genomic correlations. The method works by finding genes with unusually weak genomic correlations in their assigned network positions. We demonstrate the accuracy of the method using a cross-validated approach. In addition, we show that the method identifies a significant number of potential misannotations in Bacillus subtilis, including metabolic assignments already shown to be incorrect experimentally. The experimental analysis of the mispredicted genes forming the leucine degradation pathway in B. subtilis demonstrates that computational policing tools can generate important biological hypotheses.
Collapse
Affiliation(s)
- Tzu-Lin Hsiao
- Center for Computational Biology and Bioinformatics and Department of Biomedical Informatics, Columbia University, Irving Cancer Research Center, New York, New York, USA
| | | | | | | | | |
Collapse
|
36
|
Goffard N, Frickey T, Weiller G. PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways. Nucleic Acids Res 2009; 37:W335-9. [PMID: 19474337 PMCID: PMC2703986 DOI: 10.1093/nar/gkp432] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
The post-genomic era presents us with the challenge of linking the vast amount of raw data obtained with transcriptomic and proteomic techniques to relevant biological pathways. We present an update of PathExpress, a web-based tool to interpret gene-expression data and explore the metabolic network without being restricted to predefined pathways. We define the Enzyme Neighbourhood (EN) as a sub-network of linked enzymes with a limited path length to identify the most relevant sub-networks affected in gene-expression experiments. PathExpress is freely available at: http://bioinfoserver.rsbs.anu.edu.au/utils/PathExpress/.
Collapse
Affiliation(s)
- Nicolas Goffard
- ARC Centre of Excellence for Integrative Legume Research, Genomic Interactions Group, School of Biology, Australian National University, Canberra ACT 2601, Australia
| | | | | |
Collapse
|
37
|
A genome-scale metabolic reconstruction of Mycoplasma genitalium, iPS189. PLoS Comput Biol 2009; 5:e1000285. [PMID: 19214212 PMCID: PMC2633051 DOI: 10.1371/journal.pcbi.1000285] [Citation(s) in RCA: 111] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2008] [Accepted: 01/02/2009] [Indexed: 11/23/2022] Open
Abstract
With a genome size of ∼580 kb and approximately 480 protein coding regions, Mycoplasma genitalium is one of the smallest known self-replicating organisms and, additionally, has extremely fastidious nutrient requirements. The reduced genomic content of M. genitalium has led researchers to suggest that the molecular assembly contained in this organism may be a close approximation to the minimal set of genes required for bacterial growth. Here, we introduce a systematic approach for the construction and curation of a genome-scale in silico metabolic model for M. genitalium. Key challenges included estimation of biomass composition, handling of enzymes with broad specificities, and the lack of a defined medium. Computational tools were subsequently employed to identify and resolve connectivity gaps in the model as well as growth prediction inconsistencies with gene essentiality experimental data. The curated model, M. genitalium iPS189 (262 reactions, 274 metabolites), is 87% accurate in recapitulating in vivo gene essentiality results for M. genitalium. Approaches and tools described herein provide a roadmap for the automated construction of in silico metabolic models of other organisms. There is growing interest in elucidating the minimal number of genes needed for life. This challenge is important not just for fundamental but also practical considerations arising from the need to design microorganisms exquisitely tuned for particular applications. The genome of the pathogen Mycoplasma genitalium is believed to be a close approximation to the minimal set of genes required for bacterial growth. In this paper, we constructed a genome-scale metabolic model of M. genitalium that mathematically describes a unified characterization of its biochemical capabilities. The model accounts for 189 of the 482 genes listed in the latest genome annotation. We used computational tools during the process to bridge network gaps in the model and restore consistency with experimental data that determined which gene deletions led to cell death (i.e., are essential). We achieved 87% correct model predictions for essential genes and 89% for non-essential genes. We subsequently used the metabolic model to determine components that must be part of the growth medium. The approaches and tools described here provide a roadmap for the automated metabolic reconstruction of other organisms. This task is becoming increasingly critical as genome sequencing for new organisms is proceeding at an ever-accelerating pace.
Collapse
|
38
|
Wylie T, Martin J, Abubucker S, Yin Y, Messina D, Wang Z, McCarter JP, Mitreva M. NemaPath: online exploration of KEGG-based metabolic pathways for nematodes. BMC Genomics 2008; 9:525. [PMID: 18983679 PMCID: PMC2588608 DOI: 10.1186/1471-2164-9-525] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2008] [Accepted: 11/04/2008] [Indexed: 11/13/2022] Open
Abstract
Background Nematode.net is a web-accessible resource for investigating gene sequences from parasitic and free-living nematode genomes. Beyond the well-characterized model nematode C. elegans, over 500,000 expressed sequence tags (ESTs) and nearly 600,000 genome survey sequences (GSSs) have been generated from 36 nematode species as part of the Parasitic Nematode Genomics Program undertaken by the Genome Center at Washington University School of Medicine. However, these sequencing data are not present in most publicly available protein databases, which only include sequences in Swiss-Prot. Swiss-Prot, in turn, relies on GenBank/Embl/DDJP for predicted proteins from complete genomes or full-length proteins. Description Here we present the NemaPath pathway server, a web-based pathway-level visualization tool for navigating putative metabolic pathways for over 30 nematode species, including 27 parasites. The NemaPath approach consists of two parts: 1) a backend tool to align and evaluate nematode genomic sequences (curated EST contigs) against the annotated Kyoto Encyclopedia of Genes and Genomes (KEGG) protein database; 2) a web viewing application that displays annotated KEGG pathway maps based on desired confidence levels of primary sequence similarity as defined by a user. NemaPath also provides cross-referenced access to nematode genome information provided by other tools available on Nematode.net, including: detailed NemaGene EST cluster information; putative translations; GBrowse EST cluster views; links from nematode data to external databases for corresponding synonymous C. elegans counterparts, subject matches in KEGG's gene database, and also KEGG Ontology (KO) identification. Conclusion The NemaPath server hosts metabolic pathway mappings for 30 nematode species and is available on the World Wide Web at . The nematode source sequences used for the metabolic pathway mappings are available via FTP , as provided by the Genome Center at Washington University School of Medicine.
Collapse
Affiliation(s)
- Todd Wylie
- The Genome Center at Washington University School of Medicine, St, Louis, MO 63108, USA.
| | | | | | | | | | | | | | | |
Collapse
|
39
|
Latino DARS, Zhang QY, Aires-de-Sousa J. Genome-scale classification of metabolic reactions and assignment of EC numbers with self-organizing maps. ACTA ACUST UNITED AC 2008; 24:2236-44. [PMID: 18676416 DOI: 10.1093/bioinformatics/btn405] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer-aided validation of classification systems, to genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Comparison of metabolic reactions has been mostly based on Enzyme Commission (EC) numbers, which are extremely useful and widespread, but not always straightforward to apply, and often problematic when an enzyme catalyzes several reactions, when the same reaction is catalyzed by different enzymes, when official full EC numbers are unavailable or when reactions are not catalyzed by enzymes. Different methods should be available to compare metabolic reactions. Simultaneously, methods are required for the automatic assignment of EC numbers to reactions still not officially classified. RESULTS We have proposed the MOLMAP reaction descriptors to numerically encode the structural transformations resulting from a chemical reaction. Here, such descriptors are applied to the mapping of a genome-scale database of almost 4000 metabolic reactions by Kohonen self-organizing maps (SOMs), and its screening for inconsistencies in EC numbers. This approach allowed for the SOMs to assign EC numbers at the class, subclass and sub-subclass levels for reactions of independent test sets with accuracies up to 92, 80 and 70%, respectively. Different levels of similarity between training and test sets were explored. The approach also led to the identification of a number of similar reactions bearing differences at the EC class level. AVAILABILITY The programs to generate MOLMAP descriptors from atomic properties included in SDF files are available upon request for evaluation.
Collapse
Affiliation(s)
- Diogo A R S Latino
- CQFB, REQUIMTE, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829-516 Caparica, Portugal
| | | | | |
Collapse
|
40
|
Baart GJE, Zomer B, de Haan A, van der Pol LA, Beuvery EC, Tramper J, Martens DE. Modeling Neisseria meningitidis metabolism: from genome to metabolic fluxes. Genome Biol 2008; 8:R136. [PMID: 17617894 PMCID: PMC2323225 DOI: 10.1186/gb-2007-8-7-r136] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2006] [Revised: 03/16/2007] [Accepted: 07/06/2007] [Indexed: 01/22/2023] Open
Abstract
A genome-scale flux model for primary metabolism of Neisseria meningitidis was constructed; a minimal medium for growth of N. meningitidis was designed using the model and tested successfully in batch and chemostat cultures. Background Neisseria meningitidis is a human pathogen that can infect diverse sites within the human host. The major diseases caused by N. meningitidis are responsible for death and disability, especially in young infants. In general, most of the recent work on N. meningitidis focuses on potential antigens and their functions, immunogenicity, and pathogenicity mechanisms. Very little work has been carried out on Neisseria primary metabolism over the past 25 years. Results Using the genomic database of N. meningitidis serogroup B together with biochemical and physiological information in the literature we constructed a genome-scale flux model for the primary metabolism of N. meningitidis. The validity of a simplified metabolic network derived from the genome-scale metabolic network was checked using flux-balance analysis in chemostat cultures. Several useful predictions were obtained from in silico experiments, including substrate preference. A minimal medium for growth of N. meningitidis was designed and tested succesfully in batch and chemostat cultures. Conclusion The verified metabolic model describes the primary metabolism of N. meningitidis in a chemostat in steady state. The genome-scale model is valuable because it offers a framework to study N. meningitidis metabolism as a whole, or certain aspects of it, and it can also be used for the purpose of vaccine process development (for example, the design of growth media). The flux distribution of the main metabolic pathways (that is, the pentose phosphate pathway and the Entner-Douderoff pathway) indicates that the major part of pyruvate (69%) is synthesized through the ED-cleavage, a finding that is in good agreement with literature.
Collapse
Affiliation(s)
- Gino JE Baart
- Unit Research & Development, Netherlands Vaccine Institute (NVI), PO Box 457, 3720 AL Bilthoven, The Netherlands
- Food and Bioprocess Engineering Group, Wageningen University, PO Box 8129, 6700 EV Wageningen, The Netherlands
| | - Bert Zomer
- Unit Research & Development, Netherlands Vaccine Institute (NVI), PO Box 457, 3720 AL Bilthoven, The Netherlands
| | - Alex de Haan
- Unit Research & Development, Netherlands Vaccine Institute (NVI), PO Box 457, 3720 AL Bilthoven, The Netherlands
| | - Leo A van der Pol
- Unit Research & Development, Netherlands Vaccine Institute (NVI), PO Box 457, 3720 AL Bilthoven, The Netherlands
| | - E Coen Beuvery
- PAT Consultancy, Kerkstraat 66, 4132 BG Vianen, The Netherlands
| | - Johannes Tramper
- Food and Bioprocess Engineering Group, Wageningen University, PO Box 8129, 6700 EV Wageningen, The Netherlands
| | - Dirk E Martens
- Food and Bioprocess Engineering Group, Wageningen University, PO Box 8129, 6700 EV Wageningen, The Netherlands
| |
Collapse
|
41
|
Tsui IF, Chari R, Buys TP, Lam WL. Public databases and software for the pathway analysis of cancer genomes. Cancer Inform 2007; 3:379-97. [PMID: 19455256 PMCID: PMC2410087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The study of pathway disruption is key to understanding cancer biology. Advances in high throughput technologies have led to the rapid accumulation of genomic data. The explosion in available data has generated opportunities for investigation of concerted changes that disrupt biological functions, this in turns created a need for computational tools for pathway analysis. In this review, we discuss approaches to the analysis of genomic data and describe the publicly available resources for studying biological pathways.
Collapse
Affiliation(s)
- Ivy F.L. Tsui
- Correspondence: Ivy Tsui, BC Cancer Research Centre, 675 West 10th Avenue Vancouver, BC, V5Z 1L3, Canada. Tel: +1 604-675-8111; Fax: +1 604-675-8232;
| | | | | | | |
Collapse
|
42
|
Andorf C, Dobbs D, Honavar V. Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach. BMC Bioinformatics 2007; 8:284. [PMID: 17683567 PMCID: PMC1994202 DOI: 10.1186/1471-2105-8-284] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2006] [Accepted: 08/03/2007] [Indexed: 11/29/2022] Open
Abstract
Background Incorrectly annotated sequence data are becoming more commonplace as databases increasingly rely on automated techniques for annotation. Hence, there is an urgent need for computational methods for checking consistency of such annotations against independent sources of evidence and detecting potential annotation errors. We show how a machine learning approach designed to automatically predict a protein's Gene Ontology (GO) functional class can be employed to identify potential gene annotation errors. Results In a set of 211 previously annotated mouse protein kinases, we found that 201 of the GO annotations returned by AmiGO appear to be inconsistent with the UniProt functions assigned to their human counterparts. In contrast, 97% of the predicted annotations generated using a machine learning approach were consistent with the UniProt annotations of the human counterparts, as well as with available annotations for these mouse protein kinases in the Mouse Kinome database. Conclusion We conjecture that most of our predicted annotations are, therefore, correct and suggest that the machine learning approach developed here could be routinely used to detect potential errors in GO annotations generated by high-throughput gene annotation projects. Editors Note : Authors from the original publication (Okazaki et al.: Nature 2002, 420:563–73) have provided their response to Andorf et al, directly following the correspondence.
Collapse
|
43
|
Jones CE, Brown AL, Baumann U. Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics 2007; 8:170. [PMID: 17519041 PMCID: PMC1892569 DOI: 10.1186/1471-2105-8-170] [Citation(s) in RCA: 101] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2006] [Accepted: 05/22/2007] [Indexed: 11/10/2022] Open
Abstract
Background Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences. Results We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%. Conclusion While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.
Collapse
Affiliation(s)
- Craig E Jones
- School of Computer Science, University of Adelaide, South Australia, 5001
- Australian Centre for Plant Functional Genomics, Waite Campus, University of Adelaide, South Australia, 5064
| | - Alfred L Brown
- School of Computer Science, University of Adelaide, South Australia, 5001
| | - Ute Baumann
- Australian Centre for Plant Functional Genomics, Waite Campus, University of Adelaide, South Australia, 5064
| |
Collapse
|
44
|
Koczyk G, Wyrwicz LS, Rychlewski L. LigProf: a simple tool for in silico prediction of ligand-binding sites. J Mol Model 2007; 13:445-55. [PMID: 17200839 DOI: 10.1007/s00894-006-0165-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2006] [Accepted: 10/25/2006] [Indexed: 10/23/2022]
Abstract
With the increasing amount of data provided by both high-throughput sequencing and structural genomics studies, there is a growing need for tools to augment functional predictions for protein sequences. Broad descriptions of function can be provided by establishing the presence of protein domains associated with a particular function. To extend the domain-based annotation, LigProf provides predictions of potential ligands that bind to a protein, as well as critical residues that stabilize ligands. A P-value statistic for estimating the significance of motif occurrence is provided for all sites. Although the usefulness of the method will rise with increasing numbers of crystallographically solved molecules deposited in the PDB database, we show that it can already be applied successfully to the highly represented ligand-bound protein kinase domains of viral and human origin. The LigProf webserver is freely available at: http://www.cropnet.pl/ligprof . At present, LigProf descriptors annotate and extend major protein families from the PfamA database.
Collapse
Affiliation(s)
- Grzegorz Koczyk
- Institute of Plant Genetics, Strzeszyńska 34, 60-479, Poznań, Poland.
| | | | | |
Collapse
|
45
|
Lespinet O, Labedan B. ORENZA: a web resource for studying ORphan ENZyme activities. BMC Bioinformatics 2006; 7:436. [PMID: 17026747 PMCID: PMC1609188 DOI: 10.1186/1471-2105-7-436] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2006] [Accepted: 10/06/2006] [Indexed: 11/18/2022] Open
Abstract
Background Despite the current availability of several hundreds of thousands of amino acid sequences, more than 36% of the enzyme activities (EC numbers) defined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) are not associated with any amino acid sequence in major public databases. This wide gap separating knowledge of biochemical function and sequence information is found for nearly all classes of enzymes. Thus, there is an urgent need to explore these sequence-less EC numbers, in order to progressively close this gap. Description We designed ORENZA, a PostgreSQL database of ORphan ENZyme Activities, to collate information about the EC numbers defined by the NC-IUBMB with specific emphasis on orphan enzyme activities. Complete lists of all EC numbers and of orphan EC numbers are available and will be periodically updated. ORENZA allows one to browse the complete list of EC numbers or the subset associated with orphan enzymes or to query a specific EC number, an enzyme name or a species name for those interested in particular organisms. It is possible to search ORENZA for the different biochemical properties of the defined enzymes, the metabolic pathways in which they participate, the taxonomic data of the organisms whose genomes encode them, and many other features. The association of an enzyme activity with an amino acid sequence is clearly underlined, making it easy to identify at once the orphan enzyme activities. Interactive publishing of suggestions by the community would provide expert evidence for re-annotation of orphan EC numbers in public databases. Conclusion ORENZA is a Web resource designed to progressively bridge the unwanted gap between function (enzyme activities) and sequence (dataset present in public databases). ORENZA should increase interactions between communities of biochemists and of genomicists. This is expected to reduce the number of orphan enzyme activities by allocating gene sequences to the relevant enzymes.
Collapse
Affiliation(s)
- Olivier Lespinet
- Institut de Génétique et Microbiologie, CNRS UMR 8621, Université Paris-Sud, Bâtiment 400, 91405 Orsay Cedex, France
| | - Bernard Labedan
- Institut de Génétique et Microbiologie, CNRS UMR 8621, Université Paris-Sud, Bâtiment 400, 91405 Orsay Cedex, France
| |
Collapse
|
46
|
Galperin MY, Kolker E. New metrics for comparative genomics. Curr Opin Biotechnol 2006; 17:440-7. [PMID: 16978854 PMCID: PMC1764326 DOI: 10.1016/j.copbio.2006.08.007] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2006] [Revised: 08/10/2006] [Accepted: 08/25/2006] [Indexed: 10/24/2022]
Abstract
The availability of genome sequences from a variety of organisms presents an opportunity to apply this sequence information to solving the key problems of molecular biology. One of the principal roadblocks on this path is the lack of appropriate descriptors and metrics that could succinctly represent the new knowledge stemming from the genomic data. Several new metrics have recently been used in comparative genome analysis, yet challenges remain in finding an appropriate language for the emerging discipline of systems biology.
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA and
- Corresponding authors: Galperin, Michael Y (); Kolker, Eugene ()
| | - Eugene Kolker
- The BIATECH Institute, 19310 North Creek Pkwy, Suite 115, Bothell, WA 98011, USA
| |
Collapse
|
47
|
Philippi S, Köhler J. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet 2006; 7:482-8. [PMID: 16682980 DOI: 10.1038/nrg1872] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
A prerequisite to systems biology is the integration of heterogeneous experimental data, which are stored in numerous life-science databases. However, a wide range of obstacles that relate to access, handling and integration impede the efficient use of the contents of these databases. Addressing these issues will not only be essential for progress in systems biology, it will also be crucial for sustaining the more traditional uses of life-science databases.
Collapse
Affiliation(s)
- Stephan Philippi
- Department of Computer Science, University of Koblenz, PO Box 201602, 56016 Koblenz, Germany.
| | | |
Collapse
|
48
|
Notebaart RA, van Enckevort FHJ, Francke C, Siezen RJ, Teusink B. Accelerating the reconstruction of genome-scale metabolic networks. BMC Bioinformatics 2006; 7:296. [PMID: 16772023 PMCID: PMC1550432 DOI: 10.1186/1471-2105-7-296] [Citation(s) in RCA: 122] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2006] [Accepted: 06/13/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The genomic information of a species allows for the genome-scale reconstruction of its metabolic capacity. Such a metabolic reconstruction gives support to metabolic engineering, but also to integrative bioinformatics and visualization. Sequence-based automatic reconstructions require extensive manual curation, which can be very time-consuming. Therefore, we present a method to accelerate the time-consuming process of network reconstruction for a query species. The method exploits the availability of well-curated metabolic networks and uses high-resolution predictions of gene equivalency between species, allowing the transfer of gene-reaction associations from curated networks. RESULTS We have evaluated the method using Lactococcus lactis IL1403, for which a genome-scale metabolic network was published recently. We recovered most of the gene-reaction associations (i.e. 74 - 85%) which are incorporated in the published network. Moreover, we predicted over 200 additional genes to be associated to reactions, including genes with unknown function, genes for transporters and genes with specific metabolic reactions, which are good candidates for an extension to the previously published network. In a comparison of our developed method with the well-established approach Pathologic, we predicted 186 additional genes to be associated to reactions. We also predicted a relatively high number of complete conserved protein complexes, which are derived from curated metabolic networks, illustrating the potential predictive power of our method for protein complexes. CONCLUSION We show that our methodology can be applied to accelerate the reconstruction of genome-scale metabolic networks by taking optimal advantage of existing, manually curated networks. As orthology detection is the first step in the method, only the translated open reading frames (ORFs) of a newly sequenced genome are necessary to reconstruct a metabolic network. When more manually curated metabolic networks will become available in the near future, the usefulness of our method in network prediction is likely to increase.
Collapse
Affiliation(s)
- Richard A Notebaart
- Center for Molecular and Biomolecular Informatics, Radboud University Nijmegen, P.O.Box 9010, 6500GL Nijmegen, The Netherlands
| | - Frank HJ van Enckevort
- Center for Molecular and Biomolecular Informatics, Radboud University Nijmegen, P.O.Box 9010, 6500GL Nijmegen, The Netherlands
- NIZO food research BV, P.O.Box 20, 6710BA, Ede, The Netherlands
- Present address: Friesland Foods Corporate Research, Deventer, The Netherlands
| | - Christof Francke
- Center for Molecular and Biomolecular Informatics, Radboud University Nijmegen, P.O.Box 9010, 6500GL Nijmegen, The Netherlands
- Wageningen Center for Food Sciences, P.O.Box 557, 6700AN Wageningen, The Netherlands
| | - Roland J Siezen
- Center for Molecular and Biomolecular Informatics, Radboud University Nijmegen, P.O.Box 9010, 6500GL Nijmegen, The Netherlands
- NIZO food research BV, P.O.Box 20, 6710BA, Ede, The Netherlands
- Wageningen Center for Food Sciences, P.O.Box 557, 6700AN Wageningen, The Netherlands
| | - Bas Teusink
- Center for Molecular and Biomolecular Informatics, Radboud University Nijmegen, P.O.Box 9010, 6500GL Nijmegen, The Netherlands
- NIZO food research BV, P.O.Box 20, 6710BA, Ede, The Netherlands
- Wageningen Center for Food Sciences, P.O.Box 557, 6700AN Wageningen, The Netherlands
| |
Collapse
|
49
|
von Grotthuss M, Plewczynski D, Ginalski K, Rychlewski L, Shakhnovich EI. PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics. BMC Bioinformatics 2006; 7:53. [PMID: 16460560 PMCID: PMC1409798 DOI: 10.1186/1471-2105-7-53] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2005] [Accepted: 02/06/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The number of protein structures from structural genomics centers dramatically increases in the Protein Data Bank (PDB). Many of these structures are functionally unannotated because they have no sequence similarity to proteins of known function. However, it is possible to successfully infer function using only structural similarity. RESULTS Here we present the PDB-UF database, a web-accessible collection of predictions of enzymatic properties using structure-function relationship. The assignments were conducted for three-dimensional protein structures of unknown function that come from structural genomics initiatives. We show that 4 hypothetical proteins (with PDB accession codes: 1VH0, 1NS5, 1O6D, and 1TO0), for which standard BLAST tools such as PSI-BLAST or RPS-BLAST failed to assign any function, are probably methyltransferase enzymes. CONCLUSION We suggest that the structure-based prediction of an EC number should be conducted having the different similarity score cutoff for different protein folds. Moreover, performing the annotation using two different algorithms can reduce the rate of false positive assignments. We believe, that the presented web-based repository will help to decrease the number of protein structures that have functions marked as "unknown" in the PDB file. AVAILABILITY http://paradox.harvard.edu/PDB-UF and http://bioinfo.pl/PDB-UF.
Collapse
Affiliation(s)
- Marcin von Grotthuss
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, USA
| | | | | | - Leszek Rychlewski
- BioInfoBank Institute, ul. Limanowskiego 24A, 60-744 Poznan, Poland
- Bioinformatics Unit, Department of Physics, Adam Mickiewicz University, ul. Umultowska 85, 61 614 Poznan, Poland
| | - Eugene I Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard University, 12 Oxford Street, Cambridge, Massachusetts 02138, USA
| |
Collapse
|
50
|
Vallenet D, Labarre L, Rouy Z, Barbe V, Bocs S, Cruveiller S, Lajus A, Pascal G, Scarpelli C, Médigue C. MaGe: a microbial genome annotation system supported by synteny results. Nucleic Acids Res 2006; 34:53-65. [PMID: 16407324 PMCID: PMC1326237 DOI: 10.1093/nar/gkj406] [Citation(s) in RCA: 323] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Magnifying Genomes (MaGe) is a microbial genome annotation system based on a relational database containing information on bacterial genomes, as well as a web interface to achieve genome annotation projects. Our system allows one to initiate the annotation of a genome at the early stage of the finishing phase. MaGe's main features are (i) integration of annotation data from bacterial genomes enhanced by a gene coding re-annotation process using accurate gene models, (ii) integration of results obtained with a wide range of bioinformatics methods, among which exploration of gene context by searching for conserved synteny and reconstruction of metabolic pathways, (iii) an advanced web interface allowing multiple users to refine the automatic assignment of gene product functions. MaGe is also linked to numerous well-known biological databases and systems. Our system has been thoroughly tested during the annotation of complete bacterial genomes (Acinetobacter baylyi ADP1, Pseudoalteromonas haloplanktis, Frankia alni) and is currently used in the context of several new microbial genome annotation projects. In addition, MaGe allows for annotation curation and exploration of already published genomes from various genera (e.g. Yersinia, Bacillus and Neisseria). MaGe can be accessed at .
Collapse
Affiliation(s)
- David Vallenet
- Atelier de Génomique Comparative, CNRS-UMR8030, 2 rue Gaston Crémieux, 91057 Evry, Cedex, France.
| | | | | | | | | | | | | | | | | | | |
Collapse
|