1
|
Price MN, Arkin AP. Interactive tools for functional annotation of bacterial genomes. Database (Oxford) 2024; 2024:baae089. [PMID: 39241109 PMCID: PMC11378808 DOI: 10.1093/database/baae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 07/29/2024] [Accepted: 08/09/2024] [Indexed: 09/08/2024]
Abstract
Automated annotations of protein functions are error-prone because of our lack of knowledge of protein functions. For example, it is often impossible to predict the correct substrate for an enzyme or a transporter. Furthermore, much of the knowledge that we do have about the functions of proteins is missing from the underlying databases. We discuss how to use interactive tools to quickly find different kinds of information relevant to a protein's function. Many of these tools are available via PaperBLAST (http://papers.genomics.lbl.gov). Combining these tools often allows us to infer a protein's function. Ideally, accurate annotations would allow us to predict a bacterium's capabilities from its genome sequence, but in practice, this remains challenging. We describe interactive tools that infer potential capabilities from a genome sequence or that search a genome to find proteins that might perform a specific function of interest. Database URL: http://papers.genomics.lbl.gov.
Collapse
Affiliation(s)
- Morgan N Price
- Environmental Genomics & Systems Biology, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley, CA 94720, United States
| | - Adam P Arkin
- Environmental Genomics & Systems Biology, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd, Berkeley, CA 94720, United States
| |
Collapse
|
2
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknownme". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for 450 enzymes of unknown function from the model bacteria Escherichia coli using the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome.
Collapse
|
3
|
Hou S, Kang Z, Liu Y, Lü C, Wang X, Wang Q, Ma C, Xu P, Gao C. An enzymic l-2-hydroxyglutarate biosensor based on l-2-hydroxyglutarate dehydrogenase from Azoarcus olearius. Biosens Bioelectron 2024; 243:115740. [PMID: 37862756 DOI: 10.1016/j.bios.2023.115740] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 09/21/2023] [Accepted: 10/03/2023] [Indexed: 10/22/2023]
Abstract
l-2-Hydroxyglutarate (l-2-HG) is a critical signaling and immune metabolite but its excessive accumulation can lead to l-2-hydroxyglutaric aciduria, renal cancer, and other diseases. Development of efficient and high-throughput methods for selective l-2-HG detection is urgently required. In this study, l-2-HG dehydrogenase in Azoarcus olearius BH72 (AoL2HGDH) was screened from ten homologs and identified as an enzyme with high specificity and activity toward l-2-HG dehydrogenation. Then, an enzymatic assay-based l-2-HG-sensing fluorescent reporter, EaLHGFR which consists of AoL2HGDH and resazurin, was developed for the detection of l-2-HG. The response magnitude and limit of detection of EaLHGFR were systematically optimized using a single-factor screening strategy. The optimal biosensor EaLHGFR-2 exhibited a response magnitude of 2189.25 ± 26.89% and a limit of detection of 0.042 μM. It can accurately detect the concentration of l-2-HG in bacterial and cellular samples as well as human body fluids. Considering its desirable properties, EaLHGFR-2 may be a promising alternative for quantitation of l-2-HG in biological samples.
Collapse
Affiliation(s)
- Shuang Hou
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China
| | - Zhaoqi Kang
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China
| | - Yidong Liu
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China
| | - Chuanjuan Lü
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China
| | - Xia Wang
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China
| | - Qian Wang
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China
| | - Cuiqing Ma
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China
| | - Ping Xu
- State Key Laboratory of Microbial Metabolism, Shanghai Jiao Tong University, People's Republic of China
| | - Chao Gao
- State Key Laboratory of Microbial Technology, Shandong University, People's Republic of China.
| |
Collapse
|
4
|
de Oliveira SG, Kotowski N, Sampaio-Filho HR, Aguiar FHB, Dávila AMR, Jardim R. Metalloproteinases in Restorative Dentistry: An In Silico Study toward an Ideal Animal Model. Biomedicines 2023; 11:3042. [PMID: 38002041 PMCID: PMC10669239 DOI: 10.3390/biomedicines11113042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2023] [Revised: 09/02/2023] [Accepted: 09/13/2023] [Indexed: 11/26/2023] Open
Abstract
In dentistry, various animal models are used to evaluate adhesive systems, dental caries and periodontal diseases. Metalloproteinases (MMPs) are enzymes that degrade collagen in the dentin matrix and are categorized in over 20 different classes. Collagenases and gelatinases are intrinsic constituents of the human dentin organic matrix fibrillar network and are the most abundant MMPs in this tissue. Understanding such enzymes' action on dentin is important in the development of approaches that could reduce dentin degradation and provide restorative procedures with extended longevity. This in silico study is based on dentistry's most used animal models and intends to search for the most suitable, evolutionarily close to Homo sapiens. We were able to retrieve 176,077 mammalian MMP sequences from the UniProt database. These sequences were manually curated through a three-step process. After such, the remaining 3178 sequences were aligned in a multifasta file and phylogenetically reconstructed using the maximum likelihood method. Our study inferred that the animal models most evolutionarily related to Homo sapiens were Orcytolagus cuniculus (MMP-1 and MMP-8), Canis lupus (MMP-13), Rattus norvegicus (MMP-2) and Orcytolagus cuniculus (MMP-9). Further research will be needed for the biological validation of our findings.
Collapse
Affiliation(s)
- Simone Gomes de Oliveira
- Piracicaba School of Dentistry, Campinas State University, Piracicaba 13414-903, SP, Brazil
- School of Dentistry, State University of Rio de Janeiro, Rio de Janeiro 20551-030, RJ, Brazil
| | - Nelson Kotowski
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, Oswaldo Cruz Foundation, Rio de Janeiro 21040-900, RJ, Brazil; (N.K.); (A.M.R.D.)
| | | | | | - Alberto Martín Rivera Dávila
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, Oswaldo Cruz Foundation, Rio de Janeiro 21040-900, RJ, Brazil; (N.K.); (A.M.R.D.)
| | - Rodrigo Jardim
- Computational and Systems Biology Laboratory, Oswaldo Cruz Institute, Oswaldo Cruz Foundation, Rio de Janeiro 21040-900, RJ, Brazil; (N.K.); (A.M.R.D.)
| |
Collapse
|
5
|
Vezina B, Watts SC, Hawkey J, Cooper HB, Judd LM, Jenney AWJ, Monk JM, Holt KE, Wyres KL. Bactabolize is a tool for high-throughput generation of bacterial strain-specific metabolic models. eLife 2023; 12:RP87406. [PMID: 37815531 PMCID: PMC10564454 DOI: 10.7554/elife.87406] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/11/2023] Open
Abstract
Metabolic capacity can vary substantially within a bacterial species, leading to ecological niche separation, as well as differences in virulence and antimicrobial susceptibility. Genome-scale metabolic models are useful tools for studying the metabolic potential of individuals, and with the rapid expansion of genomic sequencing there is a wealth of data that can be leveraged for comparative analysis. However, there exist few tools to construct strain-specific metabolic models at scale. Here, we describe Bactabolize, a reference-based tool which rapidly produces strain-specific metabolic models and growth phenotype predictions. We describe a pan reference model for the priority antimicrobial-resistant pathogen, Klebsiella pneumoniae, and a quality control framework for using draft genome assemblies as input for Bactabolize. The Bactabolize-derived model for K. pneumoniae reference strain KPPR1 performed comparatively or better than currently available automated approaches CarveMe and gapseq across 507 substrate and 2317 knockout mutant growth predictions. Novel draft genomes passing our systematically defined quality control criteria resulted in models with a high degree of completeness (≥99% genes and reactions captured compared to models derived from matched complete genomes) and high accuracy (mean 0.97, n=10). We anticipate the tools and framework described herein will facilitate large-scale metabolic modelling analyses that broaden our understanding of diversity within bacterial species and inform novel control strategies for priority pathogens.
Collapse
Affiliation(s)
- Ben Vezina
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Stephen C Watts
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Jane Hawkey
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Helena B Cooper
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | - Louise M Judd
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| | | | - Jonathan M Monk
- Department of Bioengineering, University of California, San DiegoSan DiegoUnited States
| | - Kathryn E Holt
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
- Department of Infection Biology, London School of Hygiene & Tropical MedicineLondonUnited Kingdom
| | - Kelly L Wyres
- Department of Infectious Diseases, Central Clinical School, Monash UniversityMelbourneAustralia
| |
Collapse
|
6
|
Davidson RB, Coletti M, Gao M, Piatkowski B, Sreedasyam A, Quadir F, Weston DJ, Schmutz J, Cheng J, Skolnick J, Parks JM, Sedova A. Predicted structural proteome of Sphagnum divinum and proteome-scale annotation. Bioinformatics 2023; 39:btad511. [PMID: 37589594 PMCID: PMC10463551 DOI: 10.1093/bioinformatics/btad511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 08/01/2023] [Accepted: 08/16/2023] [Indexed: 08/18/2023] Open
Abstract
MOTIVATION Sphagnum-dominated peatlands store a substantial amount of terrestrial carbon. The genus is undersampled and under-studied. No experimental crystal structure from any Sphagnum species exists in the Protein Data Bank and fewer than 200 Sphagnum-related genes have structural models available in the AlphaFold Protein Structure Database. Tools and resources are needed to help bridge these gaps, and to enable the analysis of other structural proteomes now made possible by accurate structure prediction. RESULTS We present the predicted structural proteome (25 134 primary transcripts) of Sphagnum divinum computed using AlphaFold, structural alignment results of all high-confidence models against an annotated nonredundant crystallographic database of over 90,000 structures, a structure-based classification of putative Enzyme Commission (EC) numbers across this proteome, and the computational method to perform this proteome-scale structure-based annotation. AVAILABILITY AND IMPLEMENTATION All data and code are available in public repositories, detailed at https://github.com/BSDExabio/SAFA. The structural models of the S. divinum proteome have been deposited in the ModelArchive repository at https://modelarchive.org/doi/10.5452/ma-ornl-sphdiv.
Collapse
Affiliation(s)
- Russell B Davidson
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States
| | - Mark Coletti
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States
| | - Mu Gao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Bryan Piatkowski
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States
| | - Avinash Sreedasyam
- Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, United States
| | - Farhan Quadir
- Electrical Engineering and Computer Science, University of Missouri, Columbia, MS 65211, United States
| | - David J Weston
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States
| | - Jeremy Schmutz
- Genome Sequencing Center, HudsonAlpha Institute for Biotechnology, Huntsville, AL 35806, United States
- Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Jianlin Cheng
- Electrical Engineering and Computer Science, University of Missouri, Columbia, MS 65211, United States
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Jerry M Parks
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States
| | - Ada Sedova
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, United States
| |
Collapse
|
7
|
Oberg N, Zallot R, Gerlt JA. EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023; 435:168018. [PMID: 37356897 PMCID: PMC10291204 DOI: 10.1016/j.jmb.2023.168018] [Citation(s) in RCA: 79] [Impact Index Per Article: 79.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 02/04/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023]
Abstract
The Enzyme Function Initiative (EFI) provides a web resource with "genomic enzymology" web tools to leverage the protein (UniProt) and genome (European Nucleotide Archive; ENA; https://www.ebi.ac.uk/ena/) databases to assist the assignment of in vitro enzymatic activities and in vivo metabolic functions to uncharacterized enzymes (https://efi.igb.illinois.edu/). The tools enable (1) exploration of sequence-function space in enzyme families using sequence similarity networks (SSNs; EFI-EST), (2) easy access to genome context for bacterial, archaeal, and fungal proteins in the SSN clusters so that isofunctional families can be identified and their functions inferred from genome context (EFI-GNT); and (3) determination of the abundance of SSN clusters in NIH Human Metagenome Project metagenomes using chemically guided functional profiling (EFI-CGFP). We describe enhancements that enable SSNs to be generated from taxonomy categories, allowing higher resolution analyses of sequence-function space; we provide examples of the generation of taxonomy category-specific SSNs.
Collapse
Affiliation(s)
- Nils Oberg
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States
| | - Rémi Zallot
- Department of Chemistry, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK; Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK
| | - John A Gerlt
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States; Department of Biochemistry, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States; Department of Chemistry, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States.
| |
Collapse
|
8
|
Kroll A, Ranjan S, Engqvist MKM, Lercher MJ. A general model to predict small molecule substrates of enzymes based on machine and deep learning. Nat Commun 2023; 14:2787. [PMID: 37188731 DOI: 10.1038/s41467-023-38347-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Accepted: 04/21/2023] [Indexed: 05/17/2023] Open
Abstract
For most proteins annotated as enzymes, it is unknown which primary and/or secondary reactions they catalyze. Experimental characterizations of potential substrates are time-consuming and costly. Machine learning predictions could provide an efficient alternative, but are hampered by a lack of information regarding enzyme non-substrates, as available training data comprises mainly positive examples. Here, we present ESP, a general machine-learning model for the prediction of enzyme-substrate pairs with an accuracy of over 91% on independent and diverse test data. ESP can be applied successfully across widely different enzymes and a broad range of metabolites included in the training data, outperforming models designed for individual, well-studied enzyme families. ESP represents enzymes through a modified transformer model, and is trained on data augmented with randomly sampled small molecules assigned as non-substrates. By facilitating easy in silico testing of potential substrates, the ESP web server may support both basic and applied science.
Collapse
Affiliation(s)
- Alexander Kroll
- Institute for Computer Science and Department of Biology, Heinrich Heine University, D-40225, Düsseldorf, Germany
| | - Sahasra Ranjan
- Department of Computer Science and Engineering, Indian Institute of Technology Bombay, Powai, Mumbai, 400076, India
| | - Martin K M Engqvist
- Department of Biology and Bioengineering, Chalmers University of Technology, SE-412 96, Gothenburg, Sweden
- EnginZyme AB, Tomtebodevägen 6, 17165, Stockholm, Sweden
| | - Martin J Lercher
- Institute for Computer Science and Department of Biology, Heinrich Heine University, D-40225, Düsseldorf, Germany.
| |
Collapse
|
9
|
Vasina M, Kovar D, Damborsky J, Ding Y, Yang T, deMello A, Mazurenko S, Stavrakis S, Prokop Z. In-depth analysis of biocatalysts by microfluidics: An emerging source of data for machine learning. Biotechnol Adv 2023; 66:108171. [PMID: 37150331 DOI: 10.1016/j.biotechadv.2023.108171] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Revised: 05/04/2023] [Accepted: 05/04/2023] [Indexed: 05/09/2023]
Abstract
Nowadays, the vastly increasing demand for novel biotechnological products is supported by the continuous development of biocatalytic applications which provide sustainable green alternatives to chemical processes. The success of a biocatalytic application is critically dependent on how quickly we can identify and characterize enzyme variants fitting the conditions of industrial processes. While miniaturization and parallelization have dramatically increased the throughput of next-generation sequencing systems, the subsequent characterization of the obtained candidates is still a limiting process in identifying the desired biocatalysts. Only a few commercial microfluidic systems for enzyme analysis are currently available, and the transformation of numerous published prototypes into commercial platforms is still to be streamlined. This review presents the state-of-the-art, recent trends, and perspectives in applying microfluidic tools in the functional and structural analysis of biocatalysts. We discuss the advantages and disadvantages of available technologies, their reproducibility and robustness, and readiness for routine laboratory use. We also highlight the unexplored potential of microfluidics to leverage the power of machine learning for biocatalyst development.
Collapse
Affiliation(s)
- Michal Vasina
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic
| | - David Kovar
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic
| | - Jiri Damborsky
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic
| | - Yun Ding
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland
| | - Tianjin Yang
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland; Department of Biochemistry, University of Zurich, 8057 Zurich, Switzerland
| | - Andrew deMello
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic.
| | - Stavros Stavrakis
- Institute for Chemical and Bioengineering, ETH Zürich, 8093 Zürich, Switzerland.
| | - Zbynek Prokop
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, 602 00 Brno, Czech Republic; International Clinical Research Centre, St. Anne's University Hospital, 656 91 Brno, Czech Republic.
| |
Collapse
|
10
|
Kress A, Poch O, Lecompte O, Thompson JD. Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events. FRONTIERS IN BIOINFORMATICS 2023; 3:1178926. [PMID: 37151482 PMCID: PMC10158824 DOI: 10.3389/fbinf.2023.1178926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 04/05/2023] [Indexed: 05/09/2023] Open
Abstract
Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.
Collapse
|
11
|
Yokoi Y, Kawabuchi Y, Zulmajdi AA, Tanaka R, Shibata T, Muraoka T, Mori T. Cell-Penetrating Peptide-Peptide Nucleic Acid Conjugates as a Tool for Protein Functional Elucidation in the Native Bacterium. MOLECULES (BASEL, SWITZERLAND) 2022; 27:molecules27248944. [PMID: 36558072 PMCID: PMC9788395 DOI: 10.3390/molecules27248944] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 12/12/2022] [Accepted: 12/12/2022] [Indexed: 12/23/2022]
Abstract
Approximately 30% or more of the total proteins annotated from sequenced bacteria genomes are annotated as hypothetical or uncharacterized proteins. However, elucidation on the function of these proteins is hindered by the lack of simple and rapid screening methods, particularly with novel or hard-to-transform bacteria. In this report, we employed cell-penetrating peptide (CPP) -peptide nucleotide acid (PNA) conjugates to elucidate the function of such uncharacterized proteins in vivo within the native bacterium. Paenibacillus, a hard-to-transform bacterial genus, was used as a model. Two hypothetical genes showing amino acid sequence similarity to ι-carrageenases, termed cgiA and cgiB, were identified from the draft genome of Paenibacillus sp. strain YYML68, and CPP-PNA probes targeting the mRNA of the acyl carrier protein gene, acpP, and the two ι-carrageenase candidate genes were synthesized. Upon direct incubation of CPP-PNA targeting the mRNA of the acpP gene, we successfully observed growth inhibition of strain YYML68 in a concentration-dependent manner. Similarly, both the function of the candidate ι-carrageenases were also inhibited using our CPP-PNA probes allowing for the confirmation and characterization of these hypothetical proteins. In summary, we believe that CPP-PNA conjugates can serve as a simple and efficient alternative approach to characterize proteins in the native bacterium.
Collapse
Affiliation(s)
- Yasuhito Yokoi
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Yugo Kawabuchi
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Abdullah Adham Zulmajdi
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Reiji Tanaka
- Department of Life Sciences, Graduate School of Bioresources, Mie University, 1577 Kurima-machiya-cho, Tsu-shi 514-8507, Mie, Japan
| | - Toshiyuki Shibata
- Department of Life Sciences, Graduate School of Bioresources, Mie University, 1577 Kurima-machiya-cho, Tsu-shi 514-8507, Mie, Japan
| | - Takahiro Muraoka
- Department of Applied Chemistry, Graduate School of Engineering, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Tetsushi Mori
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
- Correspondence:
| |
Collapse
|
12
|
Tsvik L, Steiner B, Herzog P, Haltrich D, Sützl L. Flavin Mononucleotide-Dependent l-Lactate Dehydrogenases: Expanding the Toolbox of Enzymes for l-Lactate Biosensors. ACS OMEGA 2022; 7:41480-41492. [PMID: 36406534 PMCID: PMC9670274 DOI: 10.1021/acsomega.2c05257] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 10/19/2022] [Indexed: 06/16/2023]
Abstract
The development of L-lactate biosensors has been hampered in recent years by the lack of availability and knowledge about a wider range and diversity of L-lactate-oxidizing enzymes that can be used as bioelements in these sensors. For decades, L-lactate oxidase of Aerococcus viridans (AvLOx) has been used almost exclusively in the field of L-lactate biosensor development and has achieved somewhat like a monopoly status as a biocatalyst for these applications. Studies on other L-lactate-oxidizing enzymes are sparse and are often missing biochemical data. In this work, we made use of the vast amount of sequence information that is currently available on protein databases to investigate the naturally occurring diversity of L-lactate-utilizing enzymes of the flavin mononucleotide (FMN)-dependent α-hydroxy acid oxidoreductase (HAOx) family. We identified the HAOx sequence space specific for L-lactate oxidation and additionally discovered a not-yet described class of soluble and FMN-dependent L-lactate dehydrogenases, which are promising for the construction of second-generation biosensors or other biotechnological applications. Our work paves the way for new studies on α-hydroxy acid biosensors and proves that there is more to the HAOx family than AvLOx.
Collapse
Affiliation(s)
- Lidiia Tsvik
- Laboratory
of Food Biotechnology, Department of Food Science and Technology, University of Natural Resources and Life Sciences, Muthgasse 11, A-1190 Wien, Vienna, Austria
| | - Beate Steiner
- DirectSens
Biosensors GmbH, Am Rosenbühel
38, 3400 Klosterneuburg, Austria
| | - Peter Herzog
- DirectSens
Biosensors GmbH, Am Rosenbühel
38, 3400 Klosterneuburg, Austria
| | - Dietmar Haltrich
- Laboratory
of Food Biotechnology, Department of Food Science and Technology, University of Natural Resources and Life Sciences, Muthgasse 11, A-1190 Wien, Vienna, Austria
| | - Leander Sützl
- Laboratory
of Food Biotechnology, Department of Food Science and Technology, University of Natural Resources and Life Sciences, Muthgasse 11, A-1190 Wien, Vienna, Austria
| |
Collapse
|
13
|
Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022; 23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open
Abstract
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Collapse
Affiliation(s)
- Benjamin Goudey
- Corresponding author. Benjamin Goudey, School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010,
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| | - Karin Verspoor
- School of Computing Technologies, RMIT University Melbourne, Victoria, 3000
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| |
Collapse
|
14
|
Escudeiro P, Henry CS, Dias RP. Functional characterization of prokaryotic dark matter: the road so far and what lies ahead. CURRENT RESEARCH IN MICROBIAL SCIENCES 2022; 3:100159. [PMID: 36561390 PMCID: PMC9764257 DOI: 10.1016/j.crmicr.2022.100159] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 07/18/2022] [Accepted: 08/05/2022] [Indexed: 12/25/2022] Open
Abstract
Eight-hundred thousand to one trillion prokaryotic species may inhabit our planet. Yet, fewer than two-hundred thousand prokaryotic species have been described. This uncharted fraction of microbial diversity, and its undisclosed coding potential, is known as the "microbial dark matter" (MDM). Next-generation sequencing has allowed to collect a massive amount of genome sequence data, leading to unprecedented advances in the field of genomics. Still, harnessing new functional information from the genomes of uncultured prokaryotes is often limited by standard classification methods. These methods often rely on sequence similarity searches against reference genomes from cultured species. This hinders the discovery of unique genetic elements that are missing from the cultivated realm. It also contributes to the accumulation of prokaryotic gene products of unknown function among public sequence data repositories, highlighting the need for new approaches for sequencing data analysis and classification. Increasing evidence indicates that these proteins of unknown function might be a treasure trove of biotechnological potential. Here, we outline the challenges, opportunities, and the potential hidden within the functional dark matter (FDM) of prokaryotes. We also discuss the pitfalls surrounding molecular and computational approaches currently used to probe these uncharted waters, and discuss future opportunities for research and applications.
Collapse
Affiliation(s)
- Pedro Escudeiro
- BioISI - Instituto de Biosistemas e Ciências Integrativas, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Christopher S. Henry
- Argonne National Laboratory, Lemont, Illinois, USA
- University of Chicago, Chicago, Illinois, USA
| | - Ricardo P.M. Dias
- BioISI - Instituto de Biosistemas e Ciências Integrativas, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
- iXLab - Innovation for National Biological Resilience, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| |
Collapse
|
15
|
|
16
|
Ilgisonis EV, Pogodin PV, Kiseleva OI, Tarbeeva SN, Ponomarenko EA. Evolution of Protein Functional Annotation: Text Mining Study. J Pers Med 2022; 12:jpm12030479. [PMID: 35330478 PMCID: PMC8952229 DOI: 10.3390/jpm12030479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 03/07/2022] [Accepted: 03/08/2022] [Indexed: 11/23/2022] Open
Abstract
Within the Human Proteome Project initiative framework for creating functional annotations of uPE1 proteins, the neXt-CP50 Challenge was launched in 2018. In analogy with the missing-protein challenge, each command deciphers the functional features of the proteins in the chromosome-centric mode. However, the neXt-CP50 Challenge is more complicated than the missing-protein challenge: the approaches and methods for solving the problem are clear, but neither the concept of protein function nor specific experimental and/or bioinformatics protocols have been standardized to address it. We proposed using a retrospective analysis of the key HPP repository, the neXtProt database, to identify the most frequently used experimental and bioinformatic methods for analyzing protein functions, and the dynamics of accumulation of functional annotations. It has been shown that the dynamics of the increase in the number of proteins with known functions are greater than the progress made in the experimental confirmation of the existence of questionable proteins in the framework of the missing-protein challenge. At the same time, the functional annotation is based on the guilty-by-association postulate, according to which, based on large-scale experiments on API-MS and Y2H, proteins with unknown functions are most likely mapped through “handshakes” to biochemical processes.
Collapse
|
17
|
Investigation and Alteration of Organic Acid Synthesis Pathways in the Mammalian Gut Symbiont Bacteroides thetaiotaomicron. Microbiol Spectr 2022; 10:e0231221. [PMID: 35196806 PMCID: PMC8865466 DOI: 10.1128/spectrum.02312-21] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Members of the gut-dwelling Bacteroides genus have remarkable abilities in degrading a diverse set of fiber polysaccharide structures, most of which are found in the mammalian diet. As part of their metabolism, they convert these fibers to organic acids that can in turn provide energy to their host. While many studies have identified and characterized the genes and corresponding proteins involved in polysaccharide degradation, relatively little is known about Bacteroides genes involved in downstream metabolic pathways. Bacteroides thetaiotaomicron is one of the most studied species from the genus and is representative of this group in producing multiple organic acids as part of its metabolism. We focused here on several organic acid synthesis pathways in B. thetaiotaomicron, including those involved in formate, lactate, propionate, and acetate production. We identified potential genes involved in each pathway and characterized these through gene deletions coupled to growth assays and organic acid quantification. In addition, we developed and employed a Golden Gate-compatible plasmid system to simplify alteration of native gene expression levels. Our work both validates and contradicts previous bioinformatic gene annotations, and we develop a model on which to base future efforts. A clearer understanding of Bacteroides metabolic pathways can inform and facilitate efforts to employ these bacteria for improved human health or other utilization strategies. IMPORTANCE Both humans and animals host a large community of bacteria and other microorganisms in their gastrointestinal tracts. This community breaks down dietary fiber and produces organic acids that are used as an energy source by the body and can also help the host resist infection by various pathogens. While the Bacteroides genus is one of the most common in the gut microbiota, it is only distantly related to bacteria with well-characterized metabolic pathways and it is therefore unclear whether research insights on organic acid production in those species can also be directly applied to the Bacteroides. By investigating multiple genetic pathways for organic acid production in Bacteroides thetaiotaomicron, we provide a basis for deeper understanding of these pathways. The work further enables greater understanding of Bacteroides–host relationships, as well as inter-species relationships in the microbiota, which are of importance for both human and animal gut health.
Collapse
|
18
|
Rembeza E, Boverio A, Fraaije MW, Engqvist MKM. Discovery of Two Novel Oxidases Using a High-Throughput Activity Screen. Chembiochem 2022; 23:e202100510. [PMID: 34709726 PMCID: PMC9299179 DOI: 10.1002/cbic.202100510] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 10/27/2021] [Indexed: 12/17/2022]
Abstract
Discovery of novel enzymes is a challenging task, yet a crucial one, due to their increasing relevance as chemical catalysts and biotechnological tools. In our work we present a high-throughput screening approach to discovering novel activities. A screen of 96 putative oxidases with 23 substrates led to the discovery of two new enzymes. The first enzyme, N-acetyl-D-hexosamine oxidase (EC 1.1.3.29) from Ralstonia solanacearum, is a vanillyl alcohol oxidase-like flavoprotein displaying the highest activity with N-acetylglucosamine and N-acetylgalactosamine. Before our discovery of the enzyme, its activity was an orphan one - experimentally characterized but lacking the link to amino acid sequence. The second enzyme, from an uncultured marine euryarchaeota, is a long-chain alcohol oxidase (LCAO, EC 1.1.3.20) active with a range of fatty alcohols, with 1-dodecanol being the preferred substrate. The enzyme displays no sequence similarity to previously characterised LCAOs, and thus is a completely novel representative of a protein with such activity.
Collapse
Affiliation(s)
- Elzbieta Rembeza
- Department of Biology and Biological EngineeringChalmers University of Technology412 96GothenburgSweden
| | - Alessandro Boverio
- Molecular Enzymology GroupUniversity of GroningenNijenborgh 49747AGGroningenThe Netherlands
| | - Marco W. Fraaije
- Molecular Enzymology GroupUniversity of GroningenNijenborgh 49747AGGroningenThe Netherlands
| | - Martin K. M. Engqvist
- Department of Biology and Biological EngineeringChalmers University of Technology412 96GothenburgSweden
| |
Collapse
|