1
|
Mi Y, Marcu SB, Tabirca S, Yallapragada VV. PS-GO parametric protein search engine. Comput Struct Biotechnol J 2024; 23:1499-1509. [PMID: 38633387 PMCID: PMC11021831 DOI: 10.1016/j.csbj.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 04/01/2024] [Accepted: 04/01/2024] [Indexed: 04/19/2024] Open
Abstract
With the explosive growth of protein-related data, we are confronted with a critical scientific inquiry: How can we effectively retrieve, compare, and profoundly comprehend these protein structures to maximize the utilization of such data resources? PS-GO, a parametric protein search engine, has been specifically designed and developed to maximize the utilization of the rapidly growing volume of protein-related data. This innovative tool addresses the critical need for effective retrieval, comparison, and deep understanding of protein structures. By integrating computational biology, bioinformatics, and data science, PS-GO is capable of managing large-scale data and accurately predicting and comparing protein structures and functions. The engine is built upon the concept of parametric protein design, a computer-aided method that adjusts and optimizes protein structures and sequences to achieve desired biological functions and structural stability. PS-GO utilizes key parameters such as amino acid sequence, side chain angle, and solvent accessibility, which have a significant influence on protein structure and function. Additionally, PS-GO leverages computable parameters, derived computationally, which are crucial for understanding and predicting protein behavior. The development of PS-GO underscores the potential of parametric protein design in a variety of applications, including enhancing enzyme activity, improving antibody affinity, and designing novel functional proteins. This advancement not only provides a robust theoretical foundation for the field of protein engineering and biotechnology but also offers practical guidelines for future progress in this domain.
Collapse
Affiliation(s)
- Yanlin Mi
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
- SFI Centre for Research Training in Artificial Intelligence, University College Cork, Cork, Ireland
| | - Stefan-Bogdan Marcu
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
| | - Sabin Tabirca
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
- Faculty of Mathematics and Informatics, Transylvania University of Brasov, Brasov, Romania
| | - Venkata V.B. Yallapragada
- Centre for Advanced Photonics and Process Analytics, Munster Technological University, Cork, Ireland
| |
Collapse
|
2
|
Hu L, Li X, Li C, Wang L, Han L, Ni W, Zhou P, Hu S. Characterization of a novel multifunctional glycoside hydrolase family in the metagenome-assembled genomes of horse gut. Gene 2024; 927:148758. [PMID: 38977109 DOI: 10.1016/j.gene.2024.148758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 05/29/2024] [Accepted: 07/05/2024] [Indexed: 07/10/2024]
Abstract
The gut microbiota is a treasure trove of carbohydrate-active enzymes (CAZymes). To explore novel and efficient CAZymes, we analyzed the 4,142 metagenome-assembled genomes (MAGs) of the horse gut microbiota and found the MAG117.bin13 genome (Bacteroides fragilis) contains the highest number of polysaccharide utilisation loci sites (PULs), indicating its high capability for carbohydrate degradation. Bioinformatics analysis indicate that the PULs region of the MAG117.bin13 genome encodes many hypothetical proteins, which are important sources for exploring novel CAZymes. Interestingly, we discovered a hypothetical protein (595 amino acids). This protein exhibits potential CAZymes activity and has a lower similarity to CAZymes, we named it BfLac2275. We purified the protein using prokaryotic expression technology and studied its enzymatic function. The hydrolysis experiment of the polysaccharide substrate showed that the BfLac2275 protein has the ability to degrade α-lactose (156.94 U/mg), maltose (92.59 U/mg), raffinose (86.81 U/mg), and hyaluronic acid (5.71 U/mg). The enzyme activity is optimal at pH 5.0 and 30 ℃, indicating that the hypothetical protein BfLac2275 is a novel and multifunctional CAZymes in the glycoside hydrolases (GHs). These properties indicate that BfLac2275 has broad application prospects in many fields such as plant polysaccharide decomposition, food industry, animal feed additives and enzyme preparations. This study not only serves as a reference for exploring novel CAZymes encoded by gut microbiota but also provides an example for further studying the functional annotation of hypothetical genes in metagenomic assembly genomes.
Collapse
Affiliation(s)
- Lingling Hu
- College of Life Sciences, Shihezi University, Shihezi, Xinjiang 832003, China
| | - Xiaoyue Li
- College of Life Sciences, Shihezi University, Shihezi, Xinjiang 832003, China
| | - Cunyuan Li
- College of Life Sciences, Shihezi University, Shihezi, Xinjiang 832003, China
| | - Limin Wang
- State Key Laboratory of Sheep Genetic Improvement and Healthy Production, Xinjiang Academy of Agricultural and Reclamation Science, Xinjiang 832003, China
| | - Lin Han
- College of Life Sciences, Shihezi University, Shihezi, Xinjiang 832003, China
| | - Wei Ni
- College of Life Sciences, Shihezi University, Shihezi, Xinjiang 832003, China.
| | - Ping Zhou
- State Key Laboratory of Sheep Genetic Improvement and Healthy Production, Xinjiang Academy of Agricultural and Reclamation Science, Xinjiang 832003, China.
| | - Shengwei Hu
- College of Life Sciences, Shihezi University, Shihezi, Xinjiang 832003, China.
| |
Collapse
|
3
|
Gong X, Zhang J, Gan Q, Teng Y, Hou J, Lyu Y, Liu Z, Wu Z, Dai R, Zou Y, Wang X, Zhu D, Zhu H, Liu T, Yan Y. Advancing microbial production through artificial intelligence-aided biology. Biotechnol Adv 2024; 74:108399. [PMID: 38925317 DOI: 10.1016/j.biotechadv.2024.108399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 05/20/2024] [Accepted: 06/23/2024] [Indexed: 06/28/2024]
Abstract
Microbial cell factories (MCFs) have been leveraged to construct sustainable platforms for value-added compound production. To optimize metabolism and reach optimal productivity, synthetic biology has developed various genetic devices to engineer microbial systems by gene editing, high-throughput protein engineering, and dynamic regulation. However, current synthetic biology methodologies still rely heavily on manual design, laborious testing, and exhaustive analysis. The emerging interdisciplinary field of artificial intelligence (AI) and biology has become pivotal in addressing the remaining challenges. AI-aided microbial production harnesses the power of processing, learning, and predicting vast amounts of biological data within seconds, providing outputs with high probability. With well-trained AI models, the conventional Design-Build-Test (DBT) cycle has been transformed into a multidimensional Design-Build-Test-Learn-Predict (DBTLP) workflow, leading to significantly improved operational efficiency and reduced labor consumption. Here, we comprehensively review the main components and recent advances in AI-aided microbial production, focusing on genome annotation, AI-aided protein engineering, artificial functional protein design, and AI-enabled pathway prediction. Finally, we discuss the challenges of integrating novel AI techniques into biology and propose the potential of large language models (LLMs) in advancing microbial production.
Collapse
Affiliation(s)
- Xinyu Gong
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Jianli Zhang
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Qi Gan
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Yuxi Teng
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Jixin Hou
- School of ECAM, College of Engineering, University of Georgia, Athens, GA 30602, USA
| | - Yanjun Lyu
- Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA
| | - Zhengliang Liu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Zihao Wu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Runpeng Dai
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Yusong Zou
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA
| | - Xianqiao Wang
- School of ECAM, College of Engineering, University of Georgia, Athens, GA 30602, USA
| | - Dajiang Zhu
- Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington 76019, USA
| | - Hongtu Zhu
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Tianming Liu
- School of Computing, The University of Georgia, Athens, GA 30602, USA
| | - Yajun Yan
- School of Chemical, Materials, and Biomedical Engineering, College of Engineering, The University of Georgia, Athens, GA 30602, USA.
| |
Collapse
|
4
|
Pham DT, Tran TD. Drivergene.net: A Cytoscape app for the identification of driver nodes of large-scale complex networks and case studies in discovery of drug target genes. Comput Biol Med 2024; 179:108888. [PMID: 39047507 DOI: 10.1016/j.compbiomed.2024.108888] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 06/15/2024] [Accepted: 07/11/2024] [Indexed: 07/27/2024]
Abstract
There are no tools to identify driver nodes of large-scale networks in approach of competition-based controllability. This study proposed a novel method for this computation of large-scale networks. It implemented the method in a new Cytoscape plug-in app called Drivergene.net. Experiments of the software on large-scale biomolecular networks have shown outstanding speed and computing power. Interestingly, 86.67% of the top 10 driver nodes found on these networks are anticancer drug target genes that reside mostly at the innermost K-cores of the networks. Finally, compared method with those of five other researchers and confirmed that the proposed method outperforms the other methods on identification of anticancer drug target genes. Taken together, Drivergene.net is a reliable tool that efficiently detects not only drug target genes from biomolecular networks but also driver nodes of large-scale complex networks. Drivergene.net with a user manual and example datasets are available https://github.com/tinhpd/Drivergene.git.
Collapse
Affiliation(s)
- Duc-Tinh Pham
- Complex Systems and Bioinformatics Lab, Hanoi University of Industry, 298 Cau Dien Street, Bac Tu Liem District, Hanoi, Viet Nam; Graduate University of Science and Technology, Academy of Science and Technology Viet Nam, 18 Hoang Quoc Viet Street, Cau Giay District, Hanoi, Viet Nam
| | - Tien-Dzung Tran
- Complex Systems and Bioinformatics Lab, Hanoi University of Industry, 298 Cau Dien Street, Bac Tu Liem District, Hanoi, Viet Nam; Faculty of Information and Communication Technology, Hanoi University of Industry, 298 Cau Dien Street, Bac Tu Liem District, Hanoi, Viet Nam.
| |
Collapse
|
5
|
He B, Bu M, Lin Q, Fu Z, Xie J, Fan W, Li J, Li R, Hua W, Liu W, Cui P. CLAIR: An integrated lipid database across multiple crop species. PLANT COMMUNICATIONS 2024; 5:100855. [PMID: 38431773 PMCID: PMC11287182 DOI: 10.1016/j.xplc.2024.100855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Revised: 11/07/2023] [Accepted: 02/27/2024] [Indexed: 03/05/2024]
Affiliation(s)
- Bing He
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Mengjia Bu
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Qiang Lin
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Zhengwei Fu
- Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture and Rural Affairs, Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Wuhan 430062, China
| | - Junhua Xie
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China; State Key Laboratory of Crop Stress Adaptation and Improvement, School of Life Sciences, Henan University, Kaifeng 475004, China
| | - Wei Fan
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China
| | - Jianyang Li
- College of Agronomy, Qingdao Agricultural University, Qingdao 266109, China
| | - Ruonan Li
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China; State Key Laboratory of Crop Stress Adaptation and Improvement, School of Life Sciences, Henan University, Kaifeng 475004, China
| | - Wei Hua
- Key Laboratory of Biology and Genetic Improvement of Oil Crops, Ministry of Agriculture and Rural Affairs, Oil Crops Research Institute of the Chinese Academy of Agricultural Sciences, Wuhan 430062, China.
| | - Wanfei Liu
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China.
| | - Peng Cui
- Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Area, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518120, China.
| |
Collapse
|
6
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknownme". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for 450 enzymes of unknown function from the model bacteria Escherichia coli using the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome.
Collapse
|
7
|
Ahmed MH, Samia NSN, Singh G, Gupta V, Mishal MFM, Hossain A, Suman KH, Raza A, Dutta AK, Labony MA, Sultana J, Faysal EH, Alnasser SM, Alam P, Azam F. An immuno-informatics approach for annotation of hypothetical proteins and multi-epitope vaccine designed against the Mpox virus. J Biomol Struct Dyn 2024; 42:5288-5307. [PMID: 37519185 DOI: 10.1080/07391102.2023.2239921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2023] [Accepted: 06/09/2023] [Indexed: 08/01/2023]
Abstract
A worrying new outbreak of Monkeypox (Mpox) in humans is caused by the Mpox virus (MpoxV). The pathogen has roughly 28 hypothetical proteins of unknown structure, function, and pathogenicity. Using reliable bioinformatics tools, we attempted to analyze the MpoxV genome, identify the role of hypothetical proteins (HPs), and design a potential candidate vaccine. Out of 28, we identified seven hypothetical proteins using multi-server validation with high confidence for the occurrence of conserved domains. Their physical, chemical, and functional characterizations, including molecular weight, theoretical isoelectric point, 3D structures, GRAVY value, subcellular localization, functional motifs, antigenicity, and virulence factors, were performed. We predicted possible cytotoxic T cell (CTL), helper T cell (HTL) and linear and conformational B cell epitopes, which were combined in a 219 amino acid multiepitope vaccine with human β defensin as a linker. This multi-epitopic vaccine was structurally modelled and docked with toll-like receptor-3 (TLR-3). The dynamical stability of the vaccine-TLR-3 docked complexes exhibited stable interactions based on RMSD and RMSF tests. Additionally, the modelled vaccine was cloned in-silico in an E. coli host to check the appropriate expression of the final vaccine built. Our results might conform to an immunogenic and safe vaccine, which would require further experimental validation.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Md Hridoy Ahmed
- Department of Genetic Engineering and Biotechnology, University of Chittagong, Chittagong, Bangladesh
| | - Nure Sharaf Nower Samia
- Department of Life Sciences (DLS), School of Environment and Life Sciences (SELS), Independent University, Dhaka, Bangladesh
| | - Gagandeep Singh
- Kusuma School of Biological Sciences, Indian Institute of Technology, New Delhi, India
- Section of Microbiology, Central Ayurveda Research Institute, Jhansi CCRAS, Ministry of Ayush, India
| | - Vandana Gupta
- Department of Microbiology, Ram Lal Anand College, University of Delhi, New Delhi, India
| | | | - Alomgir Hossain
- Department of Genetic Engineering and Biotechnology, University of Rajshahi, Rajshahi, Bangladesh
| | | | - Adnan Raza
- Bioscience department, COMSATS University of Islamabad, Islamabad, Pakistan
| | - Amit Kumar Dutta
- Department of Microbiology, University of Rajshahi, Rajshahi, Bangladesh
| | - Moriom Akhter Labony
- Department of Genetic Engineering and Biotechnology, University of Chittagong, Chittagong, Bangladesh
| | - Jakia Sultana
- Department of Botany, University of Rajshahi, Rajshahi, Bangladesh
| | | | - Sulaiman Mohammed Alnasser
- Department of Pharmacology and Toxicology, Unaizah College of Pharmacy, Qassim University, Buraydah, Saudi Arabia
| | - Prawez Alam
- Department of Pharmacognosy, College of Pharmacy, Prince Sattam Bin Abdulaziz University, Al Kharj, Saudi Arabia
| | - Faizul Azam
- Department of Pharmaceutical Chemistry and Pharmacognosy, Unaizah College of Pharmacy, Qassim University, Buraydah, Saudi Arabia
| |
Collapse
|
8
|
Peyretaillade E, Akossi RF, Tournayre J, Delbac F, Wawrzyniak I. How to overcome constraints imposed by microsporidian genome features to ensure gene prediction? J Eukaryot Microbiol 2024:e13038. [PMID: 38934348 DOI: 10.1111/jeu.13038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 06/03/2024] [Accepted: 06/10/2024] [Indexed: 06/28/2024]
Abstract
Since the advent of sequencing techniques and due to their continuous evolution, it has become easier and less expensive to obtain the complete genome sequence of any organism. Nevertheless, to elucidate all biological processes governing organism development, quality annotation is essential. In genome annotation, predicting gene structure is one of the most important and captivating challenges for computational biology. This aspect of annotation requires continual optimization, particularly for genomes as unusual as those of microsporidia. Indeed, this group of fungal-related parasites exhibits specific features (highly reduced gene sizes, sequences with high rate of evolution) linked to their evolution as intracellular parasites, requiring the implementation of specific annotation approaches to consider all these features. This review aimed to outline these characteristics and to assess the increasingly efficient approaches and tools that have enhanced the accuracy of gene prediction for microsporidia, both in terms of sensitivity and specificity. Subsequently, a final part will be dedicated to postgenomic approaches aimed at reinforcing the annotation data generated by prediction software. These approaches include the characterization of other understudied genes, such as those encoding regulatory noncoding RNAs or very small proteins, which also play crucial roles in the life cycle of these microorganisms.
Collapse
Affiliation(s)
| | - Reginal F Akossi
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| | - Jérémy Tournayre
- INRAE, UMR Herbivores, Université Clermont Auvergne, VetAgro Sup, Saint-Genès-Champanelle, France
| | - Frédéric Delbac
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| | - Ivan Wawrzyniak
- LMGE, CNRS, Université Clermont Auvergne, Clermont-Ferrand, France
| |
Collapse
|
9
|
Hsieh YE, Tandon K, Verbruggen H, Nikoloski Z. Comparative analysis of metabolic models of microbial communities reconstructed from automated tools and consensus approaches. NPJ Syst Biol Appl 2024; 10:54. [PMID: 38783065 PMCID: PMC11116368 DOI: 10.1038/s41540-024-00384-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Accepted: 05/13/2024] [Indexed: 05/25/2024] Open
Abstract
Genome-scale metabolic models (GEMs) of microbial communities offer valuable insights into the functional capabilities of their members and facilitate the exploration of microbial interactions. These models are generated using different automated reconstruction tools, each relying on different biochemical databases that may affect the conclusions drawn from the in silico analysis. One way to address this problem is to employ a consensus reconstruction method that combines the outcomes of different reconstruction tools. Here, we conducted a comparative analysis of community models reconstructed from three automated tools, i.e. CarveMe, gapseq, and KBase, alongside a consensus approach, utilizing metagenomics data from two marine bacterial communities. Our analysis revealed that these reconstruction approaches, while based on the same genomes, resulted in GEMs with varying numbers of genes and reactions as well as metabolic functionalities, attributed to the different databases employed. Further, our results indicated that the set of exchanged metabolites was more influenced by the reconstruction approach rather than the specific bacterial community investigated. This observation suggests a potential bias in predicting metabolite interactions using community GEMs. We also showed that consensus models encompassed a larger number of reactions and metabolites while concurrently reducing the presence of dead-end metabolites. Therefore, the usage of consensus models allows making full and unbiased use from aggregating genes from the different reconstructions in assessing the functional potential of microbial communities.
Collapse
Affiliation(s)
- Yunli Eric Hsieh
- Bioinformatics Department, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany
- Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany
- School of BioSciences, The University of Melbourne, Parkville, VIC, Australia
| | - Kshitij Tandon
- School of BioSciences, The University of Melbourne, Parkville, VIC, Australia
| | - Heroen Verbruggen
- School of BioSciences, The University of Melbourne, Parkville, VIC, Australia
| | - Zoran Nikoloski
- Bioinformatics Department, Institute of Biochemistry and Biology, University of Potsdam, Potsdam, Germany.
- Systems Biology and Mathematical Modeling Group, Max Planck Institute of Molecular Plant Physiology, Potsdam, Germany.
| |
Collapse
|
10
|
Hogg BN, Schnepel C, Finnigan JD, Charnock SJ, Hayes MA, Turner NJ. The Impact of Metagenomics on Biocatalysis. Angew Chem Int Ed Engl 2024; 63:e202402316. [PMID: 38494442 DOI: 10.1002/anie.202402316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 03/11/2024] [Accepted: 03/12/2024] [Indexed: 03/19/2024]
Abstract
In the ever-growing demand for sustainable ways to produce high-value small molecules, biocatalysis has come to the forefront of greener routes to these chemicals. As such, the need to constantly find and optimise suitable biocatalysts for specific transformations has never been greater. Metagenome mining has been shown to rapidly expand the toolkit of promiscuous enzymes needed for new transformations, without requiring protein engineering steps. If protein engineering is needed, the metagenomic candidate can often provide a better starting point for engineering than a previously discovered enzyme on the open database or from literature, for instance. In this review, we highlight where metagenomics has made substantial impact on the area of biocatalysis in recent years. We review the discovery of enzymes in previously unexplored or 'hidden' sequence space, leading to the characterisation of enzymes with enhanced properties that originate from natural selection pressures in native environments.
Collapse
Affiliation(s)
- Bethany N Hogg
- Department of Chemistry, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, M1 7DN, UK
| | - Christian Schnepel
- School of Engineering Sciences in Chemistry, Biotechnology and Health, Department of Industrial Biotechnology, KTH Royal Institute of Technology, AlbaNova University Center, 11421, Stockholm, SE
| | - James D Finnigan
- Prozomix, Building 4, West End Ind. Estate, Haltwhistle, NE49 9HA, UK
| | - Simon J Charnock
- Prozomix, Building 4, West End Ind. Estate, Haltwhistle, NE49 9HA, UK
| | - Martin A Hayes
- Compound Synthesis and Management, Discovery Sciences, Biopharmaceuticals R&D , AstraZeneca, Mölndal 431 50, Gothenburg, SE
| | - Nicholas J Turner
- Department of Chemistry, University of Manchester, Manchester Institute of Biotechnology, 131 Princess Street, Manchester, M1 7DN, UK
| |
Collapse
|
11
|
Harrigan WL, Ferrell BD, Wommack KE, Polson SW, Schreiber ZD, Belcaid M. Improvements in viral gene annotation using large language models and soft alignments. BMC Bioinformatics 2024; 25:165. [PMID: 38664627 PMCID: PMC11046836 DOI: 10.1186/s12859-024-05779-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 04/12/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.
Collapse
Affiliation(s)
- William L Harrigan
- Hawai'i Institute of Marine Biology, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA
| | - Barbra D Ferrell
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - K Eric Wommack
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Shawn W Polson
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Zachary D Schreiber
- Department of Plant & Soil Sciences, University of Delaware, Newark, DE, 19713, USA
| | - Mahdi Belcaid
- Department of Computer Science, University of Hawai'i at Mānoa, Honolulu, HI, 96822, USA.
| |
Collapse
|
12
|
Tripp A, Braun M, Wieser F, Oberdorfer G, Lechner H. Click, Compute, Create: A Review of Web-based Tools for Enzyme Engineering. Chembiochem 2024:e202400092. [PMID: 38634409 DOI: 10.1002/cbic.202400092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/14/2024] [Accepted: 04/15/2024] [Indexed: 04/19/2024]
Abstract
Enzyme engineering, though pivotal across various biotechnological domains, is often plagued by its time-consuming and labor-intensive nature. This review aims to offer an overview of supportive in silico methodologies for this demanding endeavor. Starting from methods to predict protein structures, to classification of their activity and even the discovery of new enzymes we continue with describing tools used to increase thermostability and production yields of selected targets. Subsequently, we discuss computational methods to modulate both, the activity as well as selectivity of enzymes. Last, we present recent approaches based on cutting-edge machine learning methods to redesign enzymes. With exception of the last chapter, there is a strong focus on methods easily accessible via web-interfaces or simple Python-scripts, therefore readily useable for a diverse and broad community.
Collapse
Affiliation(s)
- Adrian Tripp
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Markus Braun
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Florian Wieser
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
| | - Gustav Oberdorfer
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| | - Horst Lechner
- Institute of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010, Graz, Austria
- BioTechMed, Graz, Austria
| |
Collapse
|
13
|
Mandwal A, Bishop SL, Castellanos M, Westlund A, Chaconas G, Davidsen J, Lewis IA. MINNO: An Open Source Software for Refining Metabolic Networks and Investigating Complex Network Activity Using Empirical Metabolomics Data. Anal Chem 2024; 96:3382-3388. [PMID: 38359900 PMCID: PMC10902815 DOI: 10.1021/acs.analchem.3c04501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 12/18/2023] [Accepted: 01/19/2024] [Indexed: 02/17/2024]
Abstract
Metabolomics is a powerful tool for uncovering biochemical diversity in a wide range of organisms. Metabolic network modeling is commonly used to frame metabolomics data in the context of a broader biological system. However, network modeling of poorly characterized nonmodel organisms remains challenging due to gene homology mismatches which lead to network architecture errors. To address this, we developed the Metabolic Interactive Nodular Network for Omics (MINNO), a web-based mapping tool that uses empirical metabolomics data to refine metabolic networks. MINNO allows users to create, modify, and interact with metabolic pathway visualizations for thousands of organisms, in both individual and multispecies contexts. Herein, we illustrate the use of MINNO in elucidating the metabolic networks of understudied species, such as those of the Borrelia genus, which cause Lyme and relapsing fever diseases. Using a hybrid genomics-metabolomics modeling approach, we constructed species-specific metabolic networks for threeBorrelia species. Using these empirically refined networks, we were able to metabolically differentiate these species via their nucleotide metabolism, which cannot be predicted from genomic networks. Additionally, using MINNO, we identified 18 missing reactions from the KEGG database, of which nine were supported by the primary literature. These examples illustrate the use of metabolomics for the empirical refining of genetically constructed networks and show how MINNO can be used to study nonmodel organisms.
Collapse
Affiliation(s)
- Ayush Mandwal
- Department
of Physics and Astronomy, University of
Calgary, 2500 University Dr NW, Calgary T2N 1N4, Alberta, Canada
| | - Stephanie L. Bishop
- Alberta
Centre for Advanced Diagnostics, Department of Biological Sciences, University of Calgary, 2500 University Dr NW, Calgary T2N 1N4, Alberta, Canada
| | - Mildred Castellanos
- Department
of Biochemistry and Molecular Biology, Cumming School of Medicine,
Snyder Institute for Chronic Diseases, University
of Calgary, 2500 University
Dr NW, Calgary T2N 1N4, Alberta, Canada
| | - Anika Westlund
- Alberta
Centre for Advanced Diagnostics, Department of Biological Sciences, University of Calgary, 2500 University Dr NW, Calgary T2N 1N4, Alberta, Canada
| | - George Chaconas
- Department
of Biochemistry and Molecular Biology, Cumming School of Medicine,
Snyder Institute for Chronic Diseases, University
of Calgary, 2500 University
Dr NW, Calgary T2N 1N4, Alberta, Canada
- Department
of Microbiology, Immunology and Infectious Diseases, Cumming School
of Medicine, Snyder Institute for Chronic Diseases, University of Calgary, 2500 University Dr NW, Calgary T2N 1N4, Alberta, Canada
| | - Jörn Davidsen
- Department
of Physics and Astronomy, University of
Calgary, 2500 University Dr NW, Calgary T2N 1N4, Alberta, Canada
- Hotchkiss
Brain Institute, University of Calgary, 2500 University Dr NW, Calgary T2N 1N4, Alberta, Canada
| | - Ian A. Lewis
- Alberta
Centre for Advanced Diagnostics, Department of Biological Sciences, University of Calgary, 2500 University Dr NW, Calgary T2N 1N4, Alberta, Canada
| |
Collapse
|
14
|
Kumar B, Lorusso E, Fosso B, Pesole G. A comprehensive overview of microbiome data in the light of machine learning applications: categorization, accessibility, and future directions. Front Microbiol 2024; 15:1343572. [PMID: 38419630 PMCID: PMC10900530 DOI: 10.3389/fmicb.2024.1343572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/29/2024] [Indexed: 03/02/2024] Open
Abstract
Metagenomics, Metabolomics, and Metaproteomics have significantly advanced our knowledge of microbial communities by providing culture-independent insights into their composition and functional potential. However, a critical challenge in this field is the lack of standard and comprehensive metadata associated with raw data, hindering the ability to perform robust data stratifications and consider confounding factors. In this comprehensive review, we categorize publicly available microbiome data into five types: shotgun sequencing, amplicon sequencing, metatranscriptomic, metabolomic, and metaproteomic data. We explore the importance of metadata for data reuse and address the challenges in collecting standardized metadata. We also, assess the limitations in metadata collection of existing public repositories collecting metagenomic data. This review emphasizes the vital role of metadata in interpreting and comparing datasets and highlights the need for standardized metadata protocols to fully leverage metagenomic data's potential. Furthermore, we explore future directions of implementation of Machine Learning (ML) in metadata retrieval, offering promising avenues for a deeper understanding of microbial communities and their ecological roles. Leveraging these tools will enhance our insights into microbial functional capabilities and ecological dynamics in diverse ecosystems. Finally, we emphasize the crucial metadata role in ML models development.
Collapse
Affiliation(s)
- Bablu Kumar
- Università degli Studi di Milano, Milan, Italy
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Erika Lorusso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| | - Bruno Fosso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Graziano Pesole
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| |
Collapse
|
15
|
Lewis IA. Boundary flux analysis: an emerging strategy for investigating metabolic pathway activity in large cohorts. Curr Opin Biotechnol 2024; 85:103027. [PMID: 38061263 DOI: 10.1016/j.copbio.2023.103027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 11/02/2023] [Accepted: 11/15/2023] [Indexed: 02/09/2024]
Abstract
Many biological phenotypes are rooted in metabolic pathway activity rather than the concentrations of individual metabolites. Despite this, most metabolomics studies only capture steady-state metabolism - not metabolic flux. Although sophisticated metabolic flux analysis strategies have been developed, these methods are technically challenging and difficult to implement in large-cohort studies. Recently, a new boundary flux analysis (BFA) approach has emerged that captures large-scale metabolic flux phenotypes by quantifying changes in metabolite levels in the media of cultured cells. This approach is advantageous because it is relatively easy to implement yet captures complex metabolic flux phenotypes. We describe the opportunities and challenges of BFA and illustrate how it can be harnessed to investigate a wide transect of biological phenomena.
Collapse
Affiliation(s)
- Ian A Lewis
- Alberta Centre for Advanced Diagnostics, Department of Biological Sciences, University of Calgary, 2500 University Drive NW, Calgary, Alberta T2N 1N4, Canada.
| |
Collapse
|
16
|
Atallah C, James K, Ou Z, Skelton J, Markham D, Burridge MS, Finnigan J, Charnock S, Wipat A. A method for the systematic selection of enzyme panel candidates by solving the maximum diversity problem. Biosystems 2024; 236:105105. [PMID: 38160995 DOI: 10.1016/j.biosystems.2023.105105] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2023] [Revised: 12/05/2023] [Accepted: 12/15/2023] [Indexed: 01/03/2024]
Abstract
Enzymes are being increasingly exploited for their potential as industrial biocatalysts. Establishing a portfolio of useful biocatalysts from large and diverse protein family is challenging and a systematic method for candidate selection promises to aid in this task. Moreover, accurate enzyme functional annotation can only be confidently guaranteed through experimental characterisation in the laboratory. The selection of catalytically diverse enzyme panels for experimental characterisation is also an important step for shedding light on the currently unannotated proteins in enzyme families. Current selection methods often lack efficiency and scalability, and are usually non-systematic. We present a novel algorithm for the automatic selection of subsets from enzyme families. A tabu search algorithm solving the maximum diversity problem for sequence identity was designed and implemented, and applied to three diverse enzyme families. We show that this approach automatically selects panels of enzymes that contain high richness and relative abundance of the known catalytic functions, and outperforms other methods such as k-medoids.
Collapse
Affiliation(s)
| | - Katherine James
- School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | - Zhen Ou
- School of Computing, Newcastle University, Newcastle upon Tyne, UK.
| | - James Skelton
- School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | - David Markham
- School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | - Matt S Burridge
- School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | | | | | - Anil Wipat
- School of Computing, Newcastle University, Newcastle upon Tyne, UK
| |
Collapse
|
17
|
Kinateder T, Mayer C, Nazet J, Sterner R. Improving enzyme functional annotation by integrating in vitro and in silico approaches: The example of histidinol phosphate phosphatases. Protein Sci 2024; 33:e4899. [PMID: 38284491 PMCID: PMC10804674 DOI: 10.1002/pro.4899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 12/13/2023] [Accepted: 01/01/2024] [Indexed: 01/30/2024]
Abstract
Advances in sequencing technologies have led to a rapid growth of public protein sequence databases, whereby the fraction of proteins with experimentally verified function continuously decreases. This problem is currently addressed by automated functional annotations with computational tools, which however lack the accuracy of experimental approaches and are susceptible to error propagation. Here, we present an approach that combines the efficiency of functional annotation by in silico methods with the rigor of enzyme characterization in vitro. First, a thorough experimental analysis of a representative enzyme of a group of homologues is performed which includes a focused alanine scan of the active site to determine a fingerprint of function-determining residues. In a second step, this fingerprint is used in combination with a sequence similarity network to identify putative isofunctional enzymes among the homologues. Using this approach in a proof-of-principle study, homologues of the histidinol phosphate phosphatase (HolPase) from Pseudomonas aeruginosa, many of which were annotated as phosphoserine phosphatases, were predicted to be HolPases. This functional annotation of the homologues was verified by in vitro testing of several representatives and an analysis of the occurrence of annotated HolPases in the corresponding phylogenetic groups. Moreover, the application of the same approach to the homologues of the HolPase from the archaeon Nitrosopumilus maritimus, which is not related to the HolPase from P. aeruginosa and was newly discovered in the course of this work, led to the annotation of the putative HolPase from various archaeal species.
Collapse
Affiliation(s)
- Thomas Kinateder
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| | - Carina Mayer
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| | - Julian Nazet
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| | - Reinhard Sterner
- Institute of Biophysics and Physical Biochemistry & Regensburg Center for BiochemistryUniversity of RegensburgRegensburgGermany
| |
Collapse
|
18
|
Fannjiang C, Listgarten J. Is Novelty Predictable? Cold Spring Harb Perspect Biol 2024; 16:a041469. [PMID: 38052497 PMCID: PMC10835614 DOI: 10.1101/cshperspect.a041469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023]
Abstract
Machine learning-based design has gained traction in the sciences, most notably in the design of small molecules, materials, and proteins, with societal applications ranging from drug development and plastic degradation to carbon sequestration. When designing objects to achieve novel property values with machine learning, one faces a fundamental challenge: how to push past the frontier of current knowledge, distilled from the training data into the model, in a manner that rationally controls the risk of failure. If one trusts learned models too much in extrapolation, one is likely to design rubbish. In contrast, if one does not extrapolate, one cannot find novelty. Herein, we ponder how one might strike a useful balance between these two extremes. We focus in particular on designing proteins with novel property values, although much of our discussion is relevant to machine learning-based design more broadly.
Collapse
Affiliation(s)
- Clara Fannjiang
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| | - Jennifer Listgarten
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA
| |
Collapse
|
19
|
Muralidharan HS, Fox NY, Pop M. The impact of transitive annotation on the training of taxonomic classifiers. Front Microbiol 2024; 14:1240957. [PMID: 38235435 PMCID: PMC10792039 DOI: 10.3389/fmicb.2023.1240957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 11/03/2023] [Indexed: 01/19/2024] Open
Abstract
Introduction A common task in the analysis of microbial communities involves assigning taxonomic labels to the sequences derived from organisms found in the communities. Frequently, such labels are assigned using machine learning algorithms that are trained to recognize individual taxonomic groups based on training data sets that comprise sequences with known taxonomic labels. Ideally, the training data should rely on labels that are experimentally verified-formal taxonomic labels require knowledge of physical and biochemical properties of organisms that cannot be directly inferred from sequence alone. However, the labels associated with sequences in biological databases are most commonly computational predictions which themselves may rely on computationally-generated data-a process commonly referred to as "transitive annotation." Methods In this manuscript we explore the implications of training a machine learning classifier (the Ribosomal Database Project's Bayesian classifier in our case) on data that itself has been computationally generated. We generate new training examples based on 16S rRNA data from a metagenomic experiment, and evaluate the extent to which the taxonomic labels predicted by the classifier change after re-training. Results We demonstrate that even a few computationally-generated training data points can significantly skew the output of the classifier to the point where entire regions of the taxonomic space can be disturbed. Discussion and conclusions We conclude with a discussion of key factors that affect the resilience of classifiers to transitively-annotated training data, and propose best practices to avoid the artifacts described in our paper.
Collapse
Affiliation(s)
- Harihara Subrahmaniam Muralidharan
- Department of Computer Science, University of Maryland, College Park, MD, United States
- Center for Bioinformatics and Computational Biology (CBCB), University of Maryland, College Park, MD, United States
| | - Noam Y. Fox
- Department of Computer Science, University of Maryland, College Park, MD, United States
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, United States
- Center for Bioinformatics and Computational Biology (CBCB), University of Maryland, College Park, MD, United States
| |
Collapse
|
20
|
Zhang Y, Takaki Y, Yoshida-Takashima Y, Hiraoka S, Kurosawa K, Nunoura T, Takai K. A sequential one-pot approach for rapid and convenient characterization of putative restriction-modification systems. mSystems 2023; 8:e0081723. [PMID: 37843256 PMCID: PMC10734518 DOI: 10.1128/msystems.00817-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Accepted: 09/05/2023] [Indexed: 10/17/2023] Open
Abstract
IMPORTANCE The elucidation of the molecular basis of virus-host coevolutionary interactions is boosted with state-of-the-art sequencing technologies. However, the sequence-only information is often insufficient to output a conclusive argument without biochemical characterizations. We proposed a 1-day and one-pot approach to confirm the exact function of putative restriction-modification (R-M) genes that presumably mediate microbial coevolution. The experiments mainly focused on a series of putative R-M enzymes from a deep-sea virus and its host bacterium. The results quickly unveiled unambiguous substrate specificities, superior catalytic performance, and unique sequence preferences for two new restriction enzymes (capable of cleaving DNA) and two new methyltransferases (capable of modifying DNA with methyl groups). The reality of the functional R-M system reinforced a model of mutually beneficial interactions with the virus in the deep-sea microbial ecosystem. The cell culture-independent approach also holds great potential for exploring novel and biotechnologically significant R-M enzymes from microbial dark matter.
Collapse
Affiliation(s)
- Yi Zhang
- SUGAR Program, X-star, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
| | - Yoshihiro Takaki
- SUGAR Program, X-star, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
| | - Yukari Yoshida-Takashima
- SUGAR Program, X-star, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
| | - Satoshi Hiraoka
- Research Center for Bioscience and Nanoscience (CeBN), MRU, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
| | - Kanako Kurosawa
- SUGAR Program, X-star, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
| | - Takuro Nunoura
- Research Center for Bioscience and Nanoscience (CeBN), MRU, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
| | - Ken Takai
- SUGAR Program, X-star, Japan Agency for Marine-Earth Science and Technology (JAMSTEC), Yokosuka, Japan
| |
Collapse
|
21
|
Dimonaco NJ, Clare A, Kenobi K, Aubrey W, Creevey CJ. StORF-Reporter: finding genes between genes. Nucleic Acids Res 2023; 51:11504-11517. [PMID: 37897345 PMCID: PMC10682499 DOI: 10.1093/nar/gkad814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 09/04/2023] [Accepted: 09/27/2023] [Indexed: 10/30/2023] Open
Abstract
Large regions of prokaryotic genomes are currently without any annotation, in part due to well-established limitations of annotation tools. For example, it is routine for genes using alternative start codons to be misreported or completely omitted. Therefore, we present StORF-Reporter, a tool that takes an annotated genome and returns regions that may contain missing CDS genes from unannotated regions. StORF-Reporter consists of two parts. The first begins with the extraction of unannotated regions from an annotated genome. Next, Stop-ORFs (StORFs) are identified in these unannotated regions. StORFs are open reading frames that are delimited by stop codons and thus can capture those genes most often missing in genome annotations. We show this methodology recovers genes missing from canonical genome annotations. We inspect the results of the genomes of model organisms, the pangenome of Escherichia coli, and a set of 5109 prokaryotic genomes of 247 genera from the Ensembl Bacteria database. StORF-Reporter extended the core, soft-core and accessory gene collections, identified novel gene families and extended families into additional genera. The high levels of sequence conservation observed between genera suggest that many of these StORFs are likely to be functional genes that should now be considered for inclusion in canonical annotations.
Collapse
Affiliation(s)
- Nicholas J Dimonaco
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth SY23 3PD, Wales, UK
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, Wales, UK
- Department of Medicine, McMaster University, Hamilton, ON, Canada
- Farncombe Family Digestive Health Research Institute, McMaster University, Hamilton, ON, Canada
- School of Biological Sciences, Queen’s University Belfast, Belfast BT7 1NN, Northern Ireland, UK
| | - Amanda Clare
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, Wales, UK
| | - Kim Kenobi
- Department of Mathematics, Aberystwyth University, Aberystwyth SY23 3BZ, Wales, UK
| | - Wayne Aubrey
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, Wales, UK
| | - Christopher J Creevey
- School of Biological Sciences, Queen’s University Belfast, Belfast BT7 1NN, Northern Ireland, UK
| |
Collapse
|
22
|
Pathira Kankanamge LS, Ruffner LA, Touch MM, Pina M, Beuning PJ, Ondrechen MJ. Functional annotation of haloacid dehalogenase superfamily structural genomics proteins. Biochem J 2023; 480:1553-1569. [PMID: 37747786 DOI: 10.1042/bcj20230057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2023] [Revised: 09/20/2023] [Accepted: 09/25/2023] [Indexed: 09/26/2023]
Abstract
Haloacid dehalogenases (HAD) are members of a large superfamily that includes many Structural Genomics proteins with poorly characterized functionality. This superfamily consists of multiple types of enzymes that can act as sugar phosphatases, haloacid dehalogenases, phosphonoacetaldehyde hydrolases, ATPases, or phosphate monoesterases. Here, we report on predicted functional annotations and experimental testing by direct biochemical assay for Structural Genomics proteins from the HAD superfamily. To characterize the functions of HAD superfamily members, nine representative HAD proteins and 21 structural genomics proteins are analyzed. Using techniques based on computed chemical and electrostatic properties of individual amino acids, the functions of five structural genomics proteins from the HAD superfamily are predicted and validated by biochemical assays. A dehalogenase-like hydrolase, RSc1362 (Uniprot Q8XZN3, PDB 3UMB) is predicted to be a dehalogenase and dehalogenase activity is confirmed experimentally. Four proteins predicted to be sugar phosphatases are characterized as follows: a sugar phosphatase from Thermophilus volcanium (Uniprot Q978Y6) with trehalose-6-phosphate phosphatase and fructose-6-phosphate phosphatase activity; haloacid dehalogenase-like hydrolase from Bacteroides thetaiotaomicron (Uniprot Q8A2F3; PDB 3NIW) with fructose-6-phosphate phosphatase and sucrose-6-phosphate phosphatase activity; putative phosphatase from Eubacterium rectale (Uniprot D0VWU2; PDB 3DAO) as a sucrose-6-phosphate phosphatase; and hypothetical protein from Geobacillus kaustophilus (Uniprot Q5L139; PDB 2PQ0) as a fructose-6-phosphate phosphatase. Most of these sugar phosphatases showed some substrate promiscuity.
Collapse
Affiliation(s)
| | - Lydia A Ruffner
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, U.S.A
| | - Mong Mary Touch
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, U.S.A
| | - Manuel Pina
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, U.S.A
| | - Penny J Beuning
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, U.S.A
| | - Mary Jo Ondrechen
- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115, U.S.A
| |
Collapse
|
23
|
Taniguchi T, Okuno M, Shinoda T, Kobayashi F, Takahashi K, Yuasa H, Nakamura Y, Tanaka H, Kajitani R, Itoh T. GINGER: an integrated method for high-accuracy prediction of gene structure in higher eukaryotes at the gene and exon level. DNA Res 2023; 30:dsad017. [PMID: 37478310 PMCID: PMC10439787 DOI: 10.1093/dnares/dsad017] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 07/10/2023] [Accepted: 07/20/2023] [Indexed: 07/23/2023] Open
Abstract
The prediction of gene structure within the genome sequence is the starting point of genome analysis, and its accuracy has a significant impact on the quality of subsequent analyses. Gene structure prediction is roughly divided into RNA-Seq-based methods, ab initio-based methods, homology-based methods, and the integration of individual prediction methods. Integrated methods are mainstream in recent genome projects because they improve prediction accuracy by combining or taking the best individual prediction findings; however, adequate prediction accuracy for eukaryotic species has not yet been achieved. Therefore, we developed an integrated tool, GINGER, that solves various issues related to gene structure prediction in higher eukaryotes. By handling artefacts in alignments of RNA and protein sequences, reconstructing gene structures via dynamic programming with appropriately weighted and scored exon/intron/intergenic regions, and applying different prediction processes and filtering criteria to multi-exon and single-exon genes, we achieved a significant improvement in accuracy compared to the existing integration methods. The feature of GINGER is its high prediction accuracy at the gene and exon levels, which is pronounced for species with more complex gene architectures. GINGER is implemented using Nextflow, which allows for the efficient and effective use of computing resources.
Collapse
Affiliation(s)
- Takeaki Taniguchi
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
- Bioproduction Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Toyohira-Ku, Sapporo, 062-8517, Japan
| | - Miki Okuno
- Division of Microbiology, Department of Infectious Medicine, Kurume University School of Medicine, Fukuoka 830-0011, Japan
| | - Takahiro Shinoda
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Fumiya Kobayashi
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Kazuki Takahashi
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Hideaki Yuasa
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Yuta Nakamura
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Hiroyuki Tanaka
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Rei Kajitani
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| | - Takehiko Itoh
- School of Life Science and Technology, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8550, Japan
| |
Collapse
|
24
|
Maatouk M, Merhej V, Pontarotti P, Ibrahim A, Rolain JM, Bittar F. Metallo-Beta-Lactamase-like Encoding Genes in Candidate Phyla Radiation: Widespread and Highly Divergent Proteins with Potential Multifunctionality. Microorganisms 2023; 11:1933. [PMID: 37630493 PMCID: PMC10459063 DOI: 10.3390/microorganisms11081933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/22/2023] [Accepted: 07/27/2023] [Indexed: 08/27/2023] Open
Abstract
The Candidate Phyla Radiation (CPR) was found to harbor a vast repertoire of genes encoding for enzymes with potential antibiotic resistance activity. Among these, as many as 3349 genes were predicted in silico to contain a metallo-beta-lactamase-like (MBL-like) fold. These proteins were subject to an in silico functional characterization by comparing their protein profiles (presence/absence of conserved protein domains) to other MBLs, including 24 already expressed in vitro, along with those of the beta-lactamase database (BLDB) (n = 761). The sequence similarity network (SSN) was then used to predict the functional clusters of CPR MBL-like sequences. Our findings showed that CPR MBL-like sequences were longer and more diverse than bacterial MBL sequences, with a high content of functional domains. Most CPR MBL-like sequences did not show any SSN connectivity with expressed MBLs, indicating the presence of many potential, yet unidentified, functions in CPR. In conclusion, CPR was shown to have many protein functions and a large sequence variability of MBL-like folds, exceeding all known MBLs. Further experimental and evolutionary studies of this superfamily of hydrolyzing enzymes are necessary to illustrate their functional annotation, origin, and expansion for adaptation or specialization within a given niche or compared to a specific substrate.
Collapse
Affiliation(s)
- Mohamad Maatouk
- Microbes, Evolution, Phylogénie et Infection (MEPHI), Institut de Recherche pour le Développement (IRD), Assistance Publique-Hôpitaux de Marseille (AP-HM), Aix-Marseille University, 13005 Marseille, France; (M.M.); (P.P.); (A.I.); (J.-M.R.)
- Institut Hospitalo-Universitaire (IHU) Méditerranée Infection, 13005 Marseille, France
| | - Vicky Merhej
- Microbes, Evolution, Phylogénie et Infection (MEPHI), Institut de Recherche pour le Développement (IRD), Assistance Publique-Hôpitaux de Marseille (AP-HM), Aix-Marseille University, 13005 Marseille, France; (M.M.); (P.P.); (A.I.); (J.-M.R.)
- Institut Hospitalo-Universitaire (IHU) Méditerranée Infection, 13005 Marseille, France
| | - Pierre Pontarotti
- Microbes, Evolution, Phylogénie et Infection (MEPHI), Institut de Recherche pour le Développement (IRD), Assistance Publique-Hôpitaux de Marseille (AP-HM), Aix-Marseille University, 13005 Marseille, France; (M.M.); (P.P.); (A.I.); (J.-M.R.)
- Institut Hospitalo-Universitaire (IHU) Méditerranée Infection, 13005 Marseille, France
- Centre National de la Recherche Scientifique (CNRS-SNC5039), 13009 Marseille, France
| | - Ahmad Ibrahim
- Microbes, Evolution, Phylogénie et Infection (MEPHI), Institut de Recherche pour le Développement (IRD), Assistance Publique-Hôpitaux de Marseille (AP-HM), Aix-Marseille University, 13005 Marseille, France; (M.M.); (P.P.); (A.I.); (J.-M.R.)
- Institut Hospitalo-Universitaire (IHU) Méditerranée Infection, 13005 Marseille, France
| | - Jean-Marc Rolain
- Microbes, Evolution, Phylogénie et Infection (MEPHI), Institut de Recherche pour le Développement (IRD), Assistance Publique-Hôpitaux de Marseille (AP-HM), Aix-Marseille University, 13005 Marseille, France; (M.M.); (P.P.); (A.I.); (J.-M.R.)
- Institut Hospitalo-Universitaire (IHU) Méditerranée Infection, 13005 Marseille, France
| | - Fadi Bittar
- Microbes, Evolution, Phylogénie et Infection (MEPHI), Institut de Recherche pour le Développement (IRD), Assistance Publique-Hôpitaux de Marseille (AP-HM), Aix-Marseille University, 13005 Marseille, France; (M.M.); (P.P.); (A.I.); (J.-M.R.)
- Institut Hospitalo-Universitaire (IHU) Méditerranée Infection, 13005 Marseille, France
| |
Collapse
|
25
|
Mandwal A, Bishop SL, Castellanos M, Westlund A, Chaconas G, Lewis I, Davidsen J. Metabolic Interactive Nodular Network for Omics (MINNO): Refining and investigating metabolic networks based on empirical metabolomics data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.14.548964. [PMID: 37503268 PMCID: PMC10370097 DOI: 10.1101/2023.07.14.548964] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Metabolomics is a powerful tool for uncovering biochemical diversity in a wide range of organisms, and metabolic network modeling is commonly used to frame results in the context of a broader homeostatic system. However, network modeling of poorly characterized, non-model organisms remains challenging due to gene homology mismatches. To address this challenge, we developed Metabolic Interactive Nodular Network for Omics (MINNO), a web-based mapping tool that takes in empirical metabolomics data to refine metabolic networks for both model and unusual organisms. MINNO allows users to create and modify interactive metabolic pathway visualizations for thousands of organisms, in both individual and multi-species contexts. Herein, we demonstrate an important application of MINNO in elucidating the metabolic networks of understudied species, such as those of the Borrelia genus, which cause Lyme disease and relapsing fever. Using a hybrid genomics-metabolomics modeling approach, we constructed species-specific metabolic networks for three Borrelia species. Using these empirically refined networks, we were able to metabolically differentiate these genetically similar species via their nucleotide and nicotinate metabolic pathways that cannot be predicted from genomic networks. These examples illustrate the use of metabolomics for the empirical refining of genetically constructed networks and show how MINNO can be used to study non-model organisms.
Collapse
Affiliation(s)
- Ayush Mandwal
- Department of Physics and Astronomy, University of Calgary, Calgary, AB, Canada
| | - Stephanie L. Bishop
- Department of Biological Sciences, University of Calgary, Calgary, AB, Canada
| | - Mildred Castellanos
- Department of Biochemistry and Molecular Biology, Cumming School of Medicine, Snyder Institute for Chronic Diseases, University of Calgary, Calgary, AB, Canada
| | - Anika Westlund
- Department of Biological Sciences, University of Calgary, Calgary, AB, Canada
| | - George Chaconas
- Department of Biochemistry and Molecular Biology, Cumming School of Medicine, Snyder Institute for Chronic Diseases, University of Calgary, Calgary, AB, Canada
| | - Ian Lewis
- Department of Biological Sciences, University of Calgary, Calgary, AB, Canada
| | - Jörn Davidsen
- Department of Physics and Astronomy, University of Calgary, Calgary, AB, Canada
- Hotchkiss Brain Institute, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
26
|
Oberg N, Zallot R, Gerlt JA. EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools. J Mol Biol 2023; 435:168018. [PMID: 37356897 PMCID: PMC10291204 DOI: 10.1016/j.jmb.2023.168018] [Citation(s) in RCA: 63] [Impact Index Per Article: 63.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Revised: 02/04/2023] [Accepted: 02/13/2023] [Indexed: 02/19/2023]
Abstract
The Enzyme Function Initiative (EFI) provides a web resource with "genomic enzymology" web tools to leverage the protein (UniProt) and genome (European Nucleotide Archive; ENA; https://www.ebi.ac.uk/ena/) databases to assist the assignment of in vitro enzymatic activities and in vivo metabolic functions to uncharacterized enzymes (https://efi.igb.illinois.edu/). The tools enable (1) exploration of sequence-function space in enzyme families using sequence similarity networks (SSNs; EFI-EST), (2) easy access to genome context for bacterial, archaeal, and fungal proteins in the SSN clusters so that isofunctional families can be identified and their functions inferred from genome context (EFI-GNT); and (3) determination of the abundance of SSN clusters in NIH Human Metagenome Project metagenomes using chemically guided functional profiling (EFI-CGFP). We describe enhancements that enable SSNs to be generated from taxonomy categories, allowing higher resolution analyses of sequence-function space; we provide examples of the generation of taxonomy category-specific SSNs.
Collapse
Affiliation(s)
- Nils Oberg
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States
| | - Rémi Zallot
- Department of Chemistry, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK; Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UK
| | - John A Gerlt
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States; Department of Biochemistry, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States; Department of Chemistry, University of Illinois at Urbana-Champaign, 1206 West Gregory Drive, Urbana, IL 61801, United States.
| |
Collapse
|
27
|
Robben M, Nasr MS, Das A, Veerla JP, Huber M, Jaworski J, Weidanz J, Luber J. Comparison of the Strengths and Weaknesses of Machine Learning Algorithms and Feature Selection on KEGG Database Microbial Gene Pathway Annotation and Its Effects on Reconstructed Network Topology. J Comput Biol 2023; 30:766-782. [PMID: 37437088 DOI: 10.1089/cmb.2022.0370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/14/2023] Open
Abstract
The development of tools for the annotation of genes from newly sequenced species has not evolved much from homologous alignment to prior annotated species. While the quality of gene annotations continues to decline as we sequence and assemble more evolutionary distant gut microbiome species, machine learning presents a high quality alternative to traditional techniques. In this study, we investigate the relative performance of common classical and nonclassical machine learning algorithms in the problem of gene annotation using human microbiome-associated species genes from the KEGG database. The majority of the ensemble, clustering, and deep learning algorithms that we investigated showed higher prediction accuracy than CD-Hit in predicting partial KEGG function. Motif-based, machine-learning methods of annotation in new species were faster and had higher precision-recall than methods of homologous alignment or orthologous gene clustering. Gradient boosted ensemble methods and neural networks also predicted higher connectivity in reconstructed KEGG pathways, finding twice as many new pathway interactions than blast alignment. The use of motif-based, machine-learning algorithms in annotation software will allow researchers to develop powerful tools to interact with bacterial microbiomes in ways previously unachievable through homologous sequence alignment alone.
Collapse
Affiliation(s)
- Michael Robben
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Mohammad Sadegh Nasr
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Avishek Das
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Jai Prakash Veerla
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Manfred Huber
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| | - Justyn Jaworski
- Department of Bioengineering, and University of Texas at Arlington, Arlington, Texas, USA
| | - Jon Weidanz
- Department of Kinesiology, University of Texas at Arlington, Arlington, Texas, USA
| | - Jacob Luber
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, Texas, USA
| |
Collapse
|
28
|
Diene SM, Pontarotti P, Azza S, Armstrong N, Pinault L, Chabrière E, Colson P, Rolain JM, Raoult D. Origin, Diversity, and Multiple Roles of Enzymes with Metallo-β-Lactamase Fold from Different Organisms. Cells 2023; 12:1752. [PMID: 37443786 PMCID: PMC10340364 DOI: 10.3390/cells12131752] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2023] [Revised: 06/23/2023] [Accepted: 06/28/2023] [Indexed: 07/15/2023] Open
Abstract
β-lactamase enzymes have generated significant interest due to their ability to confer resistance to the most commonly used family of antibiotics in human medicine. Among these enzymes, the class B β-lactamases are members of a superfamily of metallo-β-lactamase (MβL) fold proteins which are characterised by conserved motifs (i.e., HxHxDH) and are not only limited to bacteria. Indeed, as the result of several barriers, including low sequence similarity, default protein annotation, or untested enzymatic activity, MβL fold proteins have long been unexplored in other organisms. However, thanks to search approaches which are more sensitive compared to classical Blast analysis, such as the use of common ancestors to identify distant homologous sequences, we are now able to highlight their presence in different organisms including Bacteria, Archaea, Nanoarchaeota, Asgard, Humans, Giant viruses, and Candidate Phyla Radiation (CPR). These MβL fold proteins are multifunctional enzymes with diverse enzymatic or non-enzymatic activities of which, at least thirteen activities have been reported such as β-lactamase, ribonuclease, nuclease, glyoxalase, lactonase, phytase, ascorbic acid degradation, anti-cancer drug degradation, or membrane transport. In this review, we (i) discuss the existence of MβL fold enzymes in the different domains of life, (ii) present more suitable approaches to better investigating their homologous sequences in unsuspected sources, and (iii) report described MβL fold enzymes with demonstrated enzymatic or non-enzymatic activities.
Collapse
Affiliation(s)
- Seydina M. Diene
- MEPHI, IRD, AP-HM, IHU-Méditerranée Infection, Aix Marseille University, 13005 Marseille, France
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
| | - Pierre Pontarotti
- MEPHI, IRD, AP-HM, IHU-Méditerranée Infection, Aix Marseille University, 13005 Marseille, France
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
- CNRS SNC5039, 13005 Marseille, France
| | - Saïd Azza
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
- Assistance Publique-Hôpitaux de Marseille (AP-HM), IHU-Méditerranée Infection, 13005 Marseille, France
| | - Nicholas Armstrong
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
- Assistance Publique-Hôpitaux de Marseille (AP-HM), IHU-Méditerranée Infection, 13005 Marseille, France
| | - Lucile Pinault
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
- Assistance Publique-Hôpitaux de Marseille (AP-HM), IHU-Méditerranée Infection, 13005 Marseille, France
| | - Eric Chabrière
- MEPHI, IRD, AP-HM, IHU-Méditerranée Infection, Aix Marseille University, 13005 Marseille, France
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
| | - Philippe Colson
- MEPHI, IRD, AP-HM, IHU-Méditerranée Infection, Aix Marseille University, 13005 Marseille, France
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
| | - Jean-Marc Rolain
- MEPHI, IRD, AP-HM, IHU-Méditerranée Infection, Aix Marseille University, 13005 Marseille, France
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
| | - Didier Raoult
- IHU-Méditerranée Infection, 13005 Marseille, France; (S.A.)
| |
Collapse
|
29
|
Spiers AJ, Dorfmueller HC, Jerdan R, McGregor J, Nicoll A, Steel K, Cameron S. Bioinformatics characterization of BcsA-like orphan proteins suggest they form a novel family of pseudomonad cyclic-β-glucan synthases. PLoS One 2023; 18:e0286540. [PMID: 37267309 PMCID: PMC10237404 DOI: 10.1371/journal.pone.0286540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 05/18/2023] [Indexed: 06/04/2023] Open
Abstract
Bacteria produce a variety of polysaccharides with functional roles in cell surface coating, surface and host interactions, and biofilms. We have identified an 'Orphan' bacterial cellulose synthase catalytic subunit (BcsA)-like protein found in four model pseudomonads, P. aeruginosa PA01, P. fluorescens SBW25, P. putida KT2440 and P. syringae pv. tomato DC3000. Pairwise alignments indicated that the Orphan and BcsA proteins shared less than 41% sequence identity suggesting they may not have the same structural folds or function. We identified 112 Orphans among soil and plant-associated pseudomonads as well as in phytopathogenic and human opportunistic pathogenic strains. The wide distribution of these highly conserved proteins suggest they form a novel family of synthases producing a different polysaccharide. In silico analysis, including sequence comparisons, secondary structure and topology predictions, and protein structural modelling, revealed a two-domain transmembrane ovoid-like structure for the Orphan protein with a periplasmic glycosyl hydrolase family GH17 domain linked via a transmembrane region to a cytoplasmic glycosyltransferase family GT2 domain. We suggest the GT2 domain synthesises β-(1,3)-glucan that is transferred to the GH17 domain where it is cleaved and cyclised to produce cyclic-β-(1,3)-glucan (CβG). Our structural models are consistent with enzymatic characterisation and recent molecular simulations of the PaPA01 and PpKT2440 GH17 domains. It also provides a functional explanation linking PaPAK and PaPA14 Orphan (also known as NdvB) transposon mutants with CβG production and biofilm-associated antibiotic resistance. Importantly, cyclic glucans are also involved in osmoregulation, plant infection and induced systemic suppression, and our findings suggest this novel family of CβG synthases may provide similar range of adaptive responses for pseudomonads.
Collapse
Affiliation(s)
- Andrew J. Spiers
- School of Applied Sciences, Abertay University, Dundee, United Kingdom
| | - Helge C. Dorfmueller
- Division of Molecular Microbiology, School of Life Sciences, University of Dundee, Dundee, United Kingdom
| | - Robyn Jerdan
- School of Applied Sciences, Abertay University, Dundee, United Kingdom
| | - Jessica McGregor
- Nuffield Research Placement Students, School of Applied Sciences, Abertay University, Dundee, United Kingdom
| | - Abbie Nicoll
- Nuffield Research Placement Students, School of Applied Sciences, Abertay University, Dundee, United Kingdom
| | - Kenzie Steel
- Nuffield Research Placement Students, School of Applied Sciences, Abertay University, Dundee, United Kingdom
| | - Scott Cameron
- School of Applied Sciences, Abertay University, Dundee, United Kingdom
| |
Collapse
|
30
|
Schroer WF, Kepner HE, Uchimiya M, Mejia C, Rodriguez LT, Reisch CR, Moran MA. Functional annotation and importance of marine bacterial transporters of plankton exometabolites. ISME COMMUNICATIONS 2023; 3:37. [PMID: 37185952 PMCID: PMC10130141 DOI: 10.1038/s43705-023-00244-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 04/01/2023] [Accepted: 04/14/2023] [Indexed: 05/17/2023]
Abstract
Metabolite exchange within marine microbial communities transfers carbon and other major elements through global cycles and forms the basis of microbial interactions. Yet lack of gene annotations and concern about the quality of existing ones remain major impediments to revealing currencies of carbon flux. We employed an arrayed mutant library of the marine bacterium Ruegeria pomeroyi DSS-3 to experimentally annotate substrates of organic compound transporter systems, using mutant growth and compound drawdown analyses to link transporters to their cognate substrates. Mutant experiments verified substrates for thirteen R. pomeroyi transporters. Four were previously hypothesized based on gene expression data (taurine, glucose/xylose, isethionate, and cadaverine/putrescine/spermidine); five were previously hypothesized based on homology to experimentally annotated transporters in other bacteria (citrate, glycerol, N-acetylglucosamine, fumarate/malate/succinate, and dimethylsulfoniopropionate); and four had no previous annotations (thymidine, carnitine, cysteate, and 3-hydroxybutyrate). These bring the total number of experimentally-verified organic carbon influx transporters to 18 of 126 in the R. pomeroyi genome. In a longitudinal study of a coastal phytoplankton bloom, expression patterns of the experimentally annotated transporters linked them to different stages of the bloom, and also led to the hypothesis that citrate and 3-hydroxybutyrate were among the most highly available bacterial substrates. Improved functional annotation of the gatekeepers of organic carbon uptake is critical for deciphering carbon flux and fate in microbial ecosystems.
Collapse
Affiliation(s)
- William F Schroer
- Department of Marine Sciences, University of Georgia, Athens, GA, 30602, USA
| | - Hannah E Kepner
- Department of Marine Sciences, University of Georgia, Athens, GA, 30602, USA
- College of Fisheries and Ocean Sciences, University of Alaska Fairbanks, Fairbanks, AK, 99775, USA
| | - Mario Uchimiya
- Complex Carbohydrate Research Center, University of Georgia, Athens, GA, 30602, USA
| | - Catalina Mejia
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, 32611, USA
| | | | - Christopher R Reisch
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, 32611, USA
| | - Mary Ann Moran
- Department of Marine Sciences, University of Georgia, Athens, GA, 30602, USA.
| |
Collapse
|
31
|
Shan X, Goyal A, Gregor R, Cordero OX. Annotation-free discovery of functional groups in microbial communities. Nat Ecol Evol 2023; 7:716-724. [PMID: 36997739 DOI: 10.1038/s41559-023-02021-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 02/16/2023] [Indexed: 04/01/2023]
Abstract
Recent studies have shown that microbial communities are composed of groups of functionally cohesive taxa whose abundance is more stable and better-associated with metabolic fluxes than that of any individual taxon. However, identifying these functional groups in a manner that is independent of error-prone functional gene annotations remains a major open problem. Here we tackle this structure-function problem by developing a novel unsupervised approach that coarse-grains taxa into functional groups, solely on the basis of the patterns of statistical variation in species abundances and functional read-outs. We demonstrate the power of this approach on three distinct datasets. On data of replicate microcosms with heterotrophic soil bacteria, our unsupervised algorithm recovered experimentally validated functional groups that divide metabolic labour and remain stable despite large variation in species composition. When leveraged against the ocean microbiome data, our approach discovered a functional group that combines aerobic and anaerobic ammonia oxidizers whose summed abundance tracks closely with nitrate concentrations in the water column. Finally, we show that our framework can enable the detection of species groups that are probably responsible for the production or consumption of metabolites abundant in animal gut microbiomes, serving as a hypothesis-generating tool for mechanistic studies. Overall, this work advances our understanding of structure-function relationships in complex microbiomes and provides a powerful approach to discover functional groups in an objective and systematic manner.
Collapse
Affiliation(s)
- Xiaoyu Shan
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Akshit Goyal
- Physics of Living Systems, Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Rachel Gregor
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Otto X Cordero
- Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
32
|
Mishra A, Singh L, Singh D. Unboxing the black box-one step forward to understand the soil microbiome: A systematic review. MICROBIAL ECOLOGY 2023; 85:669-683. [PMID: 35112151 PMCID: PMC9957845 DOI: 10.1007/s00248-022-01962-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 01/10/2022] [Indexed: 06/14/2023]
Abstract
Soil is one of the most important assets of the planet Earth, responsible for maintaining the biodiversity and managing the ecosystem services for both managed and natural ecosystems. It encompasses large proportion of microscopic biodiversity, including prokaryotes and the microscopic eukaryotes. Soil microbiome is critical in managing the soil functions, but their activities have diminutive recognition in few systems like desert land and forest ecosystems. Soil microbiome is highly dependent on abiotic and biotic factors like pH, carbon content, soil structure, texture, and vegetation, but it can notably vary with ecosystems and the respective inhabitants. Thus, unboxing this black box is essential to comprehend the basic components adding to the soil systems and supported ecosystem services. Recent advancements in the field of molecular microbial ecology have delivered commanding tools to examine this genetic trove of soil biodiversity. Objective of this review is to provide a critical evaluation of the work on the soil microbiome, especially since the advent of the NGS techniques. The review also focuses on advances in our understanding of soil communities, their interactions, and functional capabilities along with understanding their role in maneuvering the biogeochemical cycle while underlining and tapping the unprecedented metagenomics data to infer the ecological attributes of yet undiscovered soil microbiome. This review focuses key research directions that could shape the future of basic and applied research into the soil microbiome. This review has led us to understand that it is difficult to generalize that soil microbiome plays a substantiated role in shaping the soil networks and it is indeed a vital resource for sustaining the ecosystem functioning. Exploring soil microbiome will help in unlocking their roles in various soil network. It could be resourceful in exploring and forecasting its impacts on soil systems and for dealing with alleviating problems like rapid climate change.
Collapse
Affiliation(s)
- Apurva Mishra
- Academy of Scientific and Innovative Research [AcSIR], Ghaziabad, 201002, India
- Environmental Biotechnology and Genomics Division, , CSIR-National Environmental Engineering Research Institute, Nehru Marg, Nagpur, 440020, Maharashtra, India
| | - Lal Singh
- Environmental Biotechnology and Genomics Division, , CSIR-National Environmental Engineering Research Institute, Nehru Marg, Nagpur, 440020, Maharashtra, India
| | - Dharmesh Singh
- Institute for Medical Microbiology, Immunology and Hygiene, Technical University of Munich, Trogerstrasse 30, 81675, Munich, Bavaria, Germany.
| |
Collapse
|
33
|
Derry A, Altman RB. COLLAPSE: A representation learning framework for identification and characterization of protein structural sites. Protein Sci 2023; 32:e4541. [PMID: 36519247 PMCID: PMC9847082 DOI: 10.1002/pro.4541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Revised: 12/02/2022] [Accepted: 12/08/2022] [Indexed: 12/23/2022]
Abstract
The identification and characterization of the structural sites which contribute to protein function are crucial for understanding biological mechanisms, evaluating disease risk, and developing targeted therapies. However, the quantity of known protein structures is rapidly outpacing our ability to functionally annotate them. Existing methods for function prediction either do not operate on local sites, suffer from high false positive or false negative rates, or require large site-specific training datasets, necessitating the development of new computational methods for annotating functional sites at scale. We present COLLAPSE (Compressed Latents Learned from Aligned Protein Structural Environments), a framework for learning deep representations of protein sites. COLLAPSE operates directly on the 3D positions of atoms surrounding a site and uses evolutionary relationships between homologous proteins as a self-supervision signal, enabling learned embeddings to implicitly capture structure-function relationships within each site. Our representations generalize across disparate tasks in a transfer learning context, achieving state-of-the-art performance on standardized benchmarks (protein-protein interactions and mutation stability) and on the prediction of functional sites from the Prosite database. We use COLLAPSE to search for similar sites across large protein datasets and to annotate proteins based on a database of known functional sites. These methods demonstrate that COLLAPSE is computationally efficient, tunable, and interpretable, providing a general-purpose platform for computational protein analysis.
Collapse
Affiliation(s)
- Alexander Derry
- Department of Biomedical Data ScienceStanford UniversityStanfordCaliforniaUSA
| | - Russ B. Altman
- Department of Biomedical Data ScienceStanford UniversityStanfordCaliforniaUSA
- Departments of Bioengineering, Genetics, and MedicineStanford UniversityStanfordCaliforniaUSA
| |
Collapse
|
34
|
Joshi P, Banerjee S, Hu X, Khade PM, Friedberg I. GOThresher: a program to remove annotation biases from protein function annotation datasets. Bioinformatics 2023; 39:6998200. [PMID: 36688705 DOI: 10.1093/bioinformatics/btad048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2022] [Revised: 11/30/2022] [Accepted: 01/20/2023] [Indexed: 01/24/2023] Open
Abstract
MOTIVATION Advances in sequencing technologies have led to a surge in genomic data, although the functions of many gene products coded by these genes remain unknown. While in-depth, targeted experiments that determine the functions of these gene products are crucial and routinely performed, they fail to keep up with the inflow of novel genomic data. In an attempt to address this gap, high-throughput experiments are being conducted in which a large number of genes are investigated in a single study. The annotations generated as a result of these experiments are generally biased towards a small subset of less informative Gene Ontology (GO) terms. Identifying and removing biases from protein function annotation databases is important since biases impact our understanding of protein function by providing a poor picture of the annotation landscape. Additionally, as machine learning methods for predicting protein function are becoming increasingly prevalent, it is essential that they are trained on unbiased datasets. Therefore, it is not only crucial to be aware of biases, but also to judiciously remove them from annotation datasets. RESULTS We introduce GOThresher, a Python tool that identifies and removes biases in function annotations from protein function annotation databases. AVAILABILITY AND IMPLEMENTATION GOThresher is written in Python and released via PyPI https://pypi.org/project/gothresher/ and on the Bioconda Anaconda channel https://anaconda.org/bioconda/gothresher. The source code is hosted on GitHub https://github.com/FriedbergLab/GOThresher and distributed under the GPL 3.0 license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Parnal Joshi
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - Sagnik Banerjee
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Statistics, Iowa State University, Ames, IA 50011, USA
| | - Xiao Hu
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - Pranav M Khade
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Roy J. Carver Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, IA 50011, USA
| | - Iddo Friedberg
- Bioinformatics and Computational Biology Program, Iowa State University, Ames, IA 50011, USA.,Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| |
Collapse
|
35
|
Kress A, Poch O, Lecompte O, Thompson JD. Real or fake? Measuring the impact of protein annotation errors on estimates of domain gain and loss events. FRONTIERS IN BIOINFORMATICS 2023; 3:1178926. [PMID: 37151482 PMCID: PMC10158824 DOI: 10.3389/fbinf.2023.1178926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 04/05/2023] [Indexed: 05/09/2023] Open
Abstract
Protein annotation errors can have significant consequences in a wide range of fields, ranging from protein structure and function prediction to biomedical research, drug discovery, and biotechnology. By comparing the domains of different proteins, scientists can identify common domains, classify proteins based on their domain architecture, and highlight proteins that have evolved differently in one or more species or clades. However, genome-wide identification of different protein domain architectures involves a complex error-prone pipeline that includes genome sequencing, prediction of gene exon/intron structures, and inference of protein sequences and domain annotations. Here we developed an automated fact-checking approach to distinguish true domain loss/gain events from false events caused by errors that occur during the annotation process. Using genome-wide ortholog sets and taking advantage of the high-quality human and Saccharomyces cerevisiae genome annotations, we analyzed the domain gain and loss events in the predicted proteomes of 9 non-human primates (NHP) and 20 non-S. cerevisiae fungi (NSF) as annotated in the Uniprot and Interpro databases. Our approach allowed us to quantify the impact of errors on estimates of protein domain gains and losses, and we show that domain losses are over-estimated ten-fold and three-fold in the NHP and NSF proteins respectively. This is in line with previous studies of gene-level losses, where issues with genome sequencing or gene annotation led to genes being falsely inferred as absent. In addition, we show that insistent protein domain annotations are a major factor contributing to the false events. For the first time, to our knowledge, we show that domain gains are also over-estimated by three-fold and two-fold respectively in NHP and NSF proteins. Based on our more accurate estimates, we infer that true domain losses and gains in NHP with respect to humans are observed at similar rates, while domain gains in the more divergent NSF are observed twice as frequently as domain losses with respect to S. cerevisiae. This study highlights the need to critically examine the scientific validity of protein annotations, and represents a significant step toward scalable computational fact-checking methods that may 1 day mitigate the propagation of wrong information in protein databases.
Collapse
|
36
|
Yokoi Y, Kawabuchi Y, Zulmajdi AA, Tanaka R, Shibata T, Muraoka T, Mori T. Cell-Penetrating Peptide-Peptide Nucleic Acid Conjugates as a Tool for Protein Functional Elucidation in the Native Bacterium. MOLECULES (BASEL, SWITZERLAND) 2022; 27:molecules27248944. [PMID: 36558072 PMCID: PMC9788395 DOI: 10.3390/molecules27248944] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Revised: 12/12/2022] [Accepted: 12/12/2022] [Indexed: 12/23/2022]
Abstract
Approximately 30% or more of the total proteins annotated from sequenced bacteria genomes are annotated as hypothetical or uncharacterized proteins. However, elucidation on the function of these proteins is hindered by the lack of simple and rapid screening methods, particularly with novel or hard-to-transform bacteria. In this report, we employed cell-penetrating peptide (CPP) -peptide nucleotide acid (PNA) conjugates to elucidate the function of such uncharacterized proteins in vivo within the native bacterium. Paenibacillus, a hard-to-transform bacterial genus, was used as a model. Two hypothetical genes showing amino acid sequence similarity to ι-carrageenases, termed cgiA and cgiB, were identified from the draft genome of Paenibacillus sp. strain YYML68, and CPP-PNA probes targeting the mRNA of the acyl carrier protein gene, acpP, and the two ι-carrageenase candidate genes were synthesized. Upon direct incubation of CPP-PNA targeting the mRNA of the acpP gene, we successfully observed growth inhibition of strain YYML68 in a concentration-dependent manner. Similarly, both the function of the candidate ι-carrageenases were also inhibited using our CPP-PNA probes allowing for the confirmation and characterization of these hypothetical proteins. In summary, we believe that CPP-PNA conjugates can serve as a simple and efficient alternative approach to characterize proteins in the native bacterium.
Collapse
Affiliation(s)
- Yasuhito Yokoi
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Yugo Kawabuchi
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Abdullah Adham Zulmajdi
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Reiji Tanaka
- Department of Life Sciences, Graduate School of Bioresources, Mie University, 1577 Kurima-machiya-cho, Tsu-shi 514-8507, Mie, Japan
| | - Toshiyuki Shibata
- Department of Life Sciences, Graduate School of Bioresources, Mie University, 1577 Kurima-machiya-cho, Tsu-shi 514-8507, Mie, Japan
| | - Takahiro Muraoka
- Department of Applied Chemistry, Graduate School of Engineering, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
| | - Tetsushi Mori
- Department of Biotechnology and Life Science, Tokyo University of Agriculture and Technology, 2-24-16 Naka-cho, Koganei-shi 184-8588, Tokyo, Japan
- Correspondence:
| |
Collapse
|
37
|
Tsvik L, Steiner B, Herzog P, Haltrich D, Sützl L. Flavin Mononucleotide-Dependent l-Lactate Dehydrogenases: Expanding the Toolbox of Enzymes for l-Lactate Biosensors. ACS OMEGA 2022; 7:41480-41492. [PMID: 36406534 PMCID: PMC9670274 DOI: 10.1021/acsomega.2c05257] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 10/19/2022] [Indexed: 06/16/2023]
Abstract
The development of L-lactate biosensors has been hampered in recent years by the lack of availability and knowledge about a wider range and diversity of L-lactate-oxidizing enzymes that can be used as bioelements in these sensors. For decades, L-lactate oxidase of Aerococcus viridans (AvLOx) has been used almost exclusively in the field of L-lactate biosensor development and has achieved somewhat like a monopoly status as a biocatalyst for these applications. Studies on other L-lactate-oxidizing enzymes are sparse and are often missing biochemical data. In this work, we made use of the vast amount of sequence information that is currently available on protein databases to investigate the naturally occurring diversity of L-lactate-utilizing enzymes of the flavin mononucleotide (FMN)-dependent α-hydroxy acid oxidoreductase (HAOx) family. We identified the HAOx sequence space specific for L-lactate oxidation and additionally discovered a not-yet described class of soluble and FMN-dependent L-lactate dehydrogenases, which are promising for the construction of second-generation biosensors or other biotechnological applications. Our work paves the way for new studies on α-hydroxy acid biosensors and proves that there is more to the HAOx family than AvLOx.
Collapse
Affiliation(s)
- Lidiia Tsvik
- Laboratory
of Food Biotechnology, Department of Food Science and Technology, University of Natural Resources and Life Sciences, Muthgasse 11, A-1190 Wien, Vienna, Austria
| | - Beate Steiner
- DirectSens
Biosensors GmbH, Am Rosenbühel
38, 3400 Klosterneuburg, Austria
| | - Peter Herzog
- DirectSens
Biosensors GmbH, Am Rosenbühel
38, 3400 Klosterneuburg, Austria
| | - Dietmar Haltrich
- Laboratory
of Food Biotechnology, Department of Food Science and Technology, University of Natural Resources and Life Sciences, Muthgasse 11, A-1190 Wien, Vienna, Austria
| | - Leander Sützl
- Laboratory
of Food Biotechnology, Department of Food Science and Technology, University of Natural Resources and Life Sciences, Muthgasse 11, A-1190 Wien, Vienna, Austria
| |
Collapse
|
38
|
Wackett LP. Toward a molecular understanding of fluoride stress in a model Pseudomonas strain. Environ Microbiol 2022; 24:4981-4983. [PMID: 35848109 PMCID: PMC9795876 DOI: 10.1111/1462-2920.16114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 06/21/2022] [Indexed: 12/30/2022]
Affiliation(s)
- Lawrence P. Wackett
- Department of Biochemistry, Molecular Biology and Biophysics and BioTechnology InstituteUniversity of MinnesotaSt. PaulMinnesotaUSA
| |
Collapse
|
39
|
Goudey B, Geard N, Verspoor K, Zobel J. Propagation, detection and correction of errors using the sequence database network. Brief Bioinform 2022; 23:6764545. [PMID: 36266246 PMCID: PMC9677457 DOI: 10.1093/bib/bbac416] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 07/31/2022] [Accepted: 08/28/2022] [Indexed: 12/14/2022] Open
Abstract
Nucleotide and protein sequences stored in public databases are the cornerstone of many bioinformatics analyses. The records containing these sequences are prone to a wide range of errors, including incorrect functional annotation, sequence contamination and taxonomic misclassification. One source of information that can help to detect errors are the strong interdependency between records. Novel sequences in one database draw their annotations from existing records, may generate new records in multiple other locations and will have varying degrees of similarity with existing records across a range of attributes. A network perspective of these relationships between sequence records, within and across databases, offers new opportunities to detect-or even correct-erroneous entries and more broadly to make inferences about record quality. Here, we describe this novel perspective of sequence database records as a rich network, which we call the sequence database network, and illustrate the opportunities this perspective offers for quantification of database quality and detection of spurious entries. We provide an overview of the relevant databases and describe how the interdependencies between sequence records across these databases can be exploited by network analyses. We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source. We illustrate the value of a network perspective through three case studies that use network analysis to detect errors, and explore the quality and quantity of critical relationships that would inform such network analyses. This systematic description of a network perspective of sequence database records provides a novel direction to combat the proliferation of errors within these critical bioinformatics resources.
Collapse
Affiliation(s)
- Benjamin Goudey
- Corresponding author. Benjamin Goudey, School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010,
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| | - Karin Verspoor
- School of Computing Technologies, RMIT University Melbourne, Victoria, 3000
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne Parkville, Victoria, 3010
| |
Collapse
|
40
|
Rahman MA, Heme UH, Parvez MAK. In silico functional annotation of hypothetical proteins from the Bacillus paralicheniformis strain Bac84 reveals proteins with biotechnological potentials and adaptational functions to extreme environments. PLoS One 2022; 17:e0276085. [PMID: 36228026 PMCID: PMC9560612 DOI: 10.1371/journal.pone.0276085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 09/28/2022] [Indexed: 11/26/2022] Open
Abstract
Members of the Bacillus genus are industrial cell factories due to their capacity to secrete significant quantities of biomolecules with industrial applications. The Bacillus paralicheniformis strain Bac84 was isolated from the Red Sea and it shares a close evolutionary relationship with Bacillus licheniformis. However, a significant number of proteins in its genome are annotated as functionally uncharacterized hypothetical proteins. Investigating these proteins' functions may help us better understand how bacteria survive extreme environmental conditions and to find novel targets for biotechnological applications. Therefore, the purpose of our research was to functionally annotate the hypothetical proteins from the genome of B. paralicheniformis strain Bac84. We employed a structured in-silico approach incorporating numerous bioinformatics tools and databases for functional annotation, physicochemical characterization, subcellular localization, protein-protein interactions, and three-dimensional structure determination. Sequences of 414 hypothetical proteins were evaluated and we were able to successfully attribute a function to 37 hypothetical proteins. Moreover, we performed receiver operating characteristic analysis to assess the performance of various tools used in this present study. We identified 12 proteins having significant adaptational roles to unfavorable environments such as sporulation, formation of biofilm, motility, regulation of transcription, etc. Additionally, 8 proteins were predicted with biotechnological potentials such as coenzyme A biosynthesis, phenylalanine biosynthesis, rare-sugars biosynthesis, antibiotic biosynthesis, bioremediation, and others. Evaluation of the performance of the tools showed an accuracy of 98% which represented the rationality of the tools used. This work shows that this annotation strategy will make the functional characterization of unknown proteins easier and can find the target for further investigation. The knowledge of these hypothetical proteins' potential functions aids B. paralicheniformis strain Bac84 in effectively creating a new biotechnological target. In addition, the results may also facilitate a better understanding of the survival mechanisms in harsh environmental conditions.
Collapse
Affiliation(s)
- Md. Atikur Rahman
- Institute of Microbiology, Friedrich Schiller University Jena, Thuringia, Germany
| | - Uzma Habiba Heme
- Faculty of Biological Sciences, Friedrich Schiller University Jena, Thuringia, Germany
| | | |
Collapse
|
41
|
Abdullah-Zawawi MR, Govender N, Harun S, Muhammad NAN, Zainal Z, Mohamed-Hussein ZA. Multi-Omics Approaches and Resources for Systems-Level Gene Function Prediction in the Plant Kingdom. PLANTS (BASEL, SWITZERLAND) 2022; 11:2614. [PMID: 36235479 PMCID: PMC9573505 DOI: 10.3390/plants11192614] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 09/05/2022] [Accepted: 09/13/2022] [Indexed: 06/16/2023]
Abstract
In higher plants, the complexity of a system and the components within and among species are rapidly dissected by omics technologies. Multi-omics datasets are integrated to infer and enable a comprehensive understanding of the life processes of organisms of interest. Further, growing open-source datasets coupled with the emergence of high-performance computing and development of computational tools for biological sciences have assisted in silico functional prediction of unknown genes, proteins and metabolites, otherwise known as uncharacterized. The systems biology approach includes data collection and filtration, system modelling, experimentation and the establishment of new hypotheses for experimental validation. Informatics technologies add meaningful sense to the output generated by complex bioinformatics algorithms, which are now freely available in a user-friendly graphical user interface. These resources accentuate gene function prediction at a relatively minimal cost and effort. Herein, we present a comprehensive view of relevant approaches available for system-level gene function prediction in the plant kingdom. Together, the most recent applications and sought-after principles for gene mining are discussed to benefit the plant research community. A realistic tabulation of plant genomic resources is included for a less laborious and accurate candidate gene discovery in basic plant research and improvement strategies.
Collapse
Affiliation(s)
- Muhammad-Redha Abdullah-Zawawi
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur 56000, Malaysia
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Nisha Govender
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Sarahani Harun
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Nor Azlan Nor Muhammad
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Zamri Zainal
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| | - Zeti-Azura Mohamed-Hussein
- Institute of System Biology (INBIOSIS), Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi 43600, Malaysia
| |
Collapse
|
42
|
Rhee KY, Jansen RS, Grundner C. Activity-based annotation: the emergence of systems biochemistry. Trends Biochem Sci 2022; 47:785-794. [PMID: 35430135 PMCID: PMC9378515 DOI: 10.1016/j.tibs.2022.03.017] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 03/10/2022] [Accepted: 03/22/2022] [Indexed: 01/21/2023]
Abstract
Current tools to annotate protein function have failed to keep pace with the speed of DNA sequencing and exponentially growing number of proteins of unknown function (PUFs). A major contributing factor to this mismatch is the historical lack of high-throughput methods to experimentally determine biochemical activity. Activity-based methods, such as activity-based metabolite and protein profiling, are emerging as new approaches for unbiased, global, biochemical annotation of protein function. In this review, we highlight recent experimental, activity-based approaches that offer new opportunities to determine protein function in a biologically agnostic and systems-level manner.
Collapse
Affiliation(s)
- Kyu Y Rhee
- Department of Medicine, Weill Cornell Medical College, New York, NY, USA.
| | - Robert S Jansen
- Department of Microbiology, Radboud University, Nijmegen, The Netherlands.
| | - Christoph Grundner
- Center for Global Infectious Disease Research, Seattle Children's Research Institute, Seattle, WA, USA; Department of Global Health, University of Washington, Seattle, WA, USA; Department of Pediatrics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
43
|
Kruse LH, Weigle AT, Irfan M, Martínez-Gómez J, Chobirko JD, Schaffer JE, Bennett AA, Specht CD, Jez JM, Shukla D, Moghe GD. Orthology-based analysis helps map evolutionary diversification and predict substrate class use of BAHD acyltransferases. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2022; 111:1453-1468. [PMID: 35816116 DOI: 10.1111/tpj.15902] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 06/15/2022] [Accepted: 07/05/2022] [Indexed: 06/15/2023]
Abstract
Large enzyme families catalyze metabolic diversification by virtue of their ability to use diverse chemical scaffolds. How enzyme families attain such functional diversity is not clear. Furthermore, duplication and promiscuity in such enzyme families limits their functional prediction, which has produced a burgeoning set of incompletely annotated genes in plant genomes. Here, we address these challenges using BAHD acyltransferases as a model. This fast-evolving family expanded drastically in land plants, increasing from one to five copies in algae to approximately 100 copies in diploid angiosperm genomes. Compilation of >160 published activities helped visualize the chemical space occupied by this family and define eight different classes based on structural similarities between acceptor substrates. Using orthologous groups (OGs) across 52 sequenced plant genomes, we developed a method to predict BAHD acceptor substrate class utilization as well as origins of individual BAHD OGs in plant evolution. This method was validated using six novel and 28 previously characterized enzymes and helped improve putative substrate class predictions for BAHDs in the tomato genome. Our results also revealed that while cuticular wax and lignin biosynthetic activities were more ancient, anthocyanin acylation activity was fixed in BAHDs later near the origin of angiosperms. The OG-based analysis enabled identification of signature motifs in anthocyanin-acylating BAHDs, whose importance was validated via molecular dynamic simulations, site-directed mutagenesis and kinetic assays. Our results not only describe how BAHDs contributed to evolution of multiple chemical phenotypes in the plant world but also propose a biocuration-enabled approach for improved functional annotation of plant enzyme families.
Collapse
Affiliation(s)
- Lars H Kruse
- Plant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, New York, 14853, USA
| | - Austin T Weigle
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA
| | - Mohammad Irfan
- Plant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, New York, 14853, USA
| | - Jesús Martínez-Gómez
- Plant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, New York, 14853, USA
- L.H. Bailey Hortorium, Cornell University, Ithaca, New York, 14853, USA
| | - Jason D Chobirko
- Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA
| | - Jason E Schaffer
- Department of Biology, Washington University in St. Louis, St. Louis, Missouri, 63130, USA
| | - Alexandra A Bennett
- Plant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, New York, 14853, USA
| | - Chelsea D Specht
- Plant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, New York, 14853, USA
- L.H. Bailey Hortorium, Cornell University, Ithaca, New York, 14853, USA
| | - Joseph M Jez
- Department of Biology, Washington University in St. Louis, St. Louis, Missouri, 63130, USA
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, Illinois, 61801, USA
| | - Gaurav D Moghe
- Plant Biology Section, School of Integrative Plant Sciences, Cornell University, Ithaca, New York, 14853, USA
| |
Collapse
|
44
|
Yuvaraj I, Chaudhary SK, Jeyakanthan J, Sekar K. Structure of the hypothetical protein TTHA1873 from Thermus thermophilus. Acta Crystallogr F Struct Biol Commun 2022; 78:338-346. [PMID: 36048084 PMCID: PMC9435673 DOI: 10.1107/s2053230x22008457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 08/23/2022] [Indexed: 11/10/2022] Open
Abstract
The crystal structure of an uncharacterized hypothetical protein, TTHA1873 from Thermus thermophilus, has been determined by X-ray crystallography to a resolution of 1.78 Å using the single-wavelength anomalous dispersion method. The protein crystallized as a dimer in two space groups: P43212 and P6122. Structural analysis of the hypothetical protein revealed that the overall fold of TTHA1873 has a β-sandwich jelly-roll topology with nine β-strands. TTHA1873 is a dimeric metal-binding protein that binds to two Ca2+ ions per chain, with one on the surface and the other stabilizing the dimeric interface of the two chains. A structural homology search indicates that the protein has moderate structural similarity to one domain of cell-surface proteins or agglutinin receptor proteins. Red blood cells showed visible agglutination at high concentrations of the hypothetical protein.
Collapse
Affiliation(s)
- I. Yuvaraj
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560 012, India
| | - Santosh Kumar Chaudhary
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560 012, India
| | - J. Jeyakanthan
- Structural Biology and Bio Computing Laboratory, Department of Bioinformatics, Alagappa University, Karaikudi 630 004, India
| | - K. Sekar
- Department of Computational and Data Sciences, Indian Institute of Science, Bangalore 560 012, India
| |
Collapse
|
45
|
Innovative Hybrid-Alignment Annotation Method for Bioinformatics Identification and Functional Verification of a Novel Nitric Oxide Synthase in Trichomonas vaginalis. BIOLOGY 2022; 11:biology11081210. [PMID: 36009837 PMCID: PMC9404748 DOI: 10.3390/biology11081210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 08/06/2022] [Accepted: 08/08/2022] [Indexed: 11/17/2022]
Abstract
Simple Summary Both the annotation and identification of genes in pathogenic parasites remain challenging. As a survival factor, nitric oxide (NO) has been proven to be synthesized in Trichomonas vaginalis (TV). However, nitric oxide synthase (NOS) has not yet been annotated in the TV genome. By aligning whole coding sequences of TV against a thousand sequences of known proteins from other organisms via the Smith–Waterman and Needleman–Wunsch algorithms, we developed a witness-to-suspect strategy to identify incorrectly annotated genes in TV. A novel NOS of TV (TV NOS) with a high witness-to-suspect ratio, which was originally annotated as a hydrogenase in the NCBI database, was successfully identified. We then performed in silico modeling of the protein structure and the molecular docking of all cofactors (NADPH, tetrahydrobiopterin (BH4), heme and flavin adenine dinucleotide (FAD)), cloned the gene, expressed and purified the protein, and ultimately performed mass spectrometry analysis and enzymatic activity assays. We clearly showed that although the predicted structure of TV NOS is not similar to that of NOS proteins of other species, all cofactor-binding motifs can interact with their ligands with high affinities. Most importantly, the purified protein is a functional NOS, as it has a high enzymatic activity for generating NO in vitro. This study provides an innovative approach to identify incorrectly annotated genes. Abstract Both the annotation and identification of genes in pathogenic parasites are still challenging. Although, as a survival factor, nitric oxide (NO) has been proven to be synthesized in Trichomonas vaginalis (TV), nitric oxide synthase (NOS) has not yet been annotated in the TV genome. We developed a witness-to-suspect strategy to identify incorrectly annotated genes in TV via the Smith–Waterman and Needleman–Wunsch algorithms through in-depth and repeated alignment of whole coding sequences of TV against thousands of sequences of known proteins from other organisms. A novel NOS of TV (TV NOS), which was annotated as hydrogenase in the NCBI database, was successfully identified; this TV NOS had a high witness-to-suspect ratio and contained all the NOS cofactor-binding motifs (NADPH, tetrahydrobiopterin (BH4), heme and flavin adenine dinucleotide (FAD) motifs). To confirm this identification, we performed in silico modeling of the protein structure and cofactor docking, cloned the gene, expressed and purified the protein, performed mass spectrometry analysis, and ultimately performed an assay to measure enzymatic activity. Our data showed that although the predicted structure of the TV NOS protein was not similar to the structure of NOSs of other species, all cofactor-binding motifs could interact with their ligands with high affinities. We clearly showed that the purified protein had high enzymatic activity for generating NO in vitro. This study provides an innovative approach to identify incorrectly annotated genes in TV and highlights a novel NOS that might serve as a virulence factor of TV.
Collapse
|
46
|
de Crécy-lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A, Blaby I, Blaby-Haas C, Bridge AJ, Burley SK, Cleveland S, Colwell LJ, Conesa A, Dallago C, Danchin A, de Waard A, Deutschbauer A, Dias R, Ding Y, Fang G, Friedberg I, Gerlt J, Goldford J, Gorelik M, Gyori BM, Henry C, Hutinet G, Jaroch M, Karp PD, Kondratova L, Lu Z, Marchler-Bauer A, Martin MJ, McWhite C, Moghe GD, Monaghan P, Morgat A, Mungall CJ, Natale DA, Nelson WC, O’Donoghue S, Orengo C, O’Toole KH, Radivojac P, Reed C, Roberts RJ, Rodionov D, Rodionova IA, Rudolf JD, Saleh L, Sheynkman G, Thibaud-Nissen F, Thomas PD, Uetz P, Vallenet D, Carter EW, Weigele PR, Wood V, Wood-Charlson EM, Xu J. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022; 2022:6663924. [PMID: 35961013 PMCID: PMC9374478 DOI: 10.1093/database/baac062] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/28/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022]
Abstract
Over the last 25 years, biology has entered the genomic era and is becoming a science of ‘big data’. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3–4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
Collapse
Affiliation(s)
- Valérie de Crécy-lagard
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | | | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware , Newark, DE 19713, USA
| | - Jill Babor
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus , Hinxton CB10 1SD, UK
| | - Ian Blaby
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Crysten Blaby-Haas
- Biology Department, Brookhaven National Laboratory , Upton, NY 11973, USA
| | - Alan J Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire , Geneva 4 CH-1211, Switzerland
| | - Stephen K Burley
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey , Piscataway, NJ 08854, USA
| | - Stacey Cleveland
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Lucy J Colwell
- Departmenf of Chemistry, University of Cambridge , Lensfield Road, Cambridge CB2 1EW, UK
| | - Ana Conesa
- Spanish National Research Council, Institute for Integrative Systems Biology , Paterna, Valencia 46980, Spain
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology , i12, Boltzmannstr. 3, Garching/Munich 85748, Germany
| | - Antoine Danchin
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, The University of Hong Kong , 21 Sassoon Road, Pokfulam, SAR Hong Kong 999077, China
| | - Anita de Waard
- Research Collaboration Unit, Elsevier , Jericho, VT 05465, USA
| | - Adam Deutschbauer
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Raquel Dias
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Yousong Ding
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida , Gainesville, FL 32610, USA
| | - Gang Fang
- NYU-Shanghai , Shanghai 200120, China
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University , Ames, IA 50011, USA
| | - John Gerlt
- Institute for Genomic Biology and Departments of Biochemistry and Chemistry, University of Illinois at Urbana-Champaign , Urbana, IL 61801, USA
| | - Joshua Goldford
- Physics of Living Systems, Massachusetts Institute of Technology , Cambridge, MA 02139, USA
| | - Mark Gorelik
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School , Boston, MA 02115, USA
| | - Christopher Henry
- Mathematics and Computer Science Division, Argonne National Laboratory , Argonne, IL 60439, USA
| | - Geoffrey Hutinet
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | - Peter D Karp
- Bioinformatics Research Group, SRI International , Menlo Park, CA 94025, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Maria-Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus , Hinxton CB10 1SD, UK
| | - Claire McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University , Princeton, NJ 08540, USA
| | - Gaurav D Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University , Ithaca, NY 14853, USA
| | - Paul Monaghan
- Department of Agricultural Education and Communication, University of Florida , Gainesville, FL 32611, USA
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire , Geneva 4 CH-1211, Switzerland
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Darren A Natale
- Georgetown University Medical Center , Washington, DC 20007, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratories , Richland, WA 99354, USA
| | - Seán O’Donoghue
- School of Biotechnology and Biomolecular Sciences, University of NSW , Sydney, NSW 2052, Australia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London , London WC1E 6BT, UK
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University , Boston, MA 02115, USA
| | - Colbie Reed
- Department of Microbiology and Cell Sciences, University of Florida , Gainesville, FL 32611, USA
| | | | - Dmitri Rodionov
- Sanford Burnham Prebys Medical Discovery Institute , La Jolla, CA 92037, USA
| | - Irina A Rodionova
- Department of Bioengineering, Division of Engineering, University of California at San Diego , La Jolla, CA 92093-0412, USA
| | - Jeffrey D Rudolf
- Department of Chemistry, University of Florida , Gainesville, FL 32611, USA
| | - Lana Saleh
- New England Biolabs , Ipswich, MA 01938, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia , Charlottesville, VA, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California , Los Angeles, CA 90033, USA
| | - Peter Uetz
- Center for Biological Data Science, Virginia Commonwealth University , Richmond, VA 23284, USA
| | - David Vallenet
- LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS , Evry 91057, France
| | - Erica Watson Carter
- Department of Plant Pathology, University of Florida Citrus Research and Education Center , 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| | | | - Valerie Wood
- Department of Biochemistry, University of Cambridge , Cambridge CB2 1GA, UK
| | - Elisha M Wood-Charlson
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory , Berkeley, CA 94720, USA
| | - Jin Xu
- Department of Plant Pathology, University of Florida Citrus Research and Education Center , 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| |
Collapse
|
47
|
Grimplet J. Genomic and Bioinformatic Resources for Perennial Fruit Species. Curr Genomics 2022; 23:217-233. [PMID: 36777875 PMCID: PMC9875543 DOI: 10.2174/1389202923666220428102632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 03/12/2022] [Accepted: 03/12/2022] [Indexed: 11/22/2022] Open
Abstract
In the post-genomic era, data management and development of bioinformatic tools are critical for the adequate exploitation of genomics data. In this review, we address the actual situation for the subset of crops represented by the perennial fruit species. The agronomical singularity of these species compared to plant and crop model species provides significant challenges on the implementation of good practices generally not addressed in other species. Studies are usually performed over several years in non-controlled environments, usage of rootstock is common, and breeders heavily rely on vegetative propagation. A reference genome is now available for all the major species as well as many members of the economically important genera for breeding purposes. Development of pangenome for these species is beginning to gain momentum which will require a substantial effort in term of bioinformatic tool development. The available tools for genome annotation and functional analysis will also be presented.
Collapse
Affiliation(s)
- Jérôme Grimplet
- Centro de Investigación y Tecnología Agroalimentaria de Aragón (CITA), Unidad de Hortofruticultura, Gobierno de Aragón, Avda. Montañana, Zaragoza, Spain;,Instituto Agroalimentario de Aragón–IA2 (CITA-Universidad de Zaragoza), Calle Miguel Servet, Zaragoza, Spain,Address correspondence to this author at the Centro de Investigación y Tecnología Agroalimentaria de Aragón (CITA), Unidad de Hortofruticultura, Gobierno de Aragón, Avda. Montañana, Zaragoza, Spain; Instituto Agroalimentario de Aragón–IA2 (CITA-Universidad de Zaragoza), Calle Miguel Servet, Zaragoza, Spain; Tel: +34976713635; E-mail:
| |
Collapse
|
48
|
Escudeiro P, Henry CS, Dias RP. Functional characterization of prokaryotic dark matter: the road so far and what lies ahead. CURRENT RESEARCH IN MICROBIAL SCIENCES 2022; 3:100159. [PMID: 36561390 PMCID: PMC9764257 DOI: 10.1016/j.crmicr.2022.100159] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2022] [Revised: 07/18/2022] [Accepted: 08/05/2022] [Indexed: 12/25/2022] Open
Abstract
Eight-hundred thousand to one trillion prokaryotic species may inhabit our planet. Yet, fewer than two-hundred thousand prokaryotic species have been described. This uncharted fraction of microbial diversity, and its undisclosed coding potential, is known as the "microbial dark matter" (MDM). Next-generation sequencing has allowed to collect a massive amount of genome sequence data, leading to unprecedented advances in the field of genomics. Still, harnessing new functional information from the genomes of uncultured prokaryotes is often limited by standard classification methods. These methods often rely on sequence similarity searches against reference genomes from cultured species. This hinders the discovery of unique genetic elements that are missing from the cultivated realm. It also contributes to the accumulation of prokaryotic gene products of unknown function among public sequence data repositories, highlighting the need for new approaches for sequencing data analysis and classification. Increasing evidence indicates that these proteins of unknown function might be a treasure trove of biotechnological potential. Here, we outline the challenges, opportunities, and the potential hidden within the functional dark matter (FDM) of prokaryotes. We also discuss the pitfalls surrounding molecular and computational approaches currently used to probe these uncharted waters, and discuss future opportunities for research and applications.
Collapse
Affiliation(s)
- Pedro Escudeiro
- BioISI - Instituto de Biosistemas e Ciências Integrativas, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - Christopher S. Henry
- Argonne National Laboratory, Lemont, Illinois, USA,University of Chicago, Chicago, Illinois, USA
| | - Ricardo P.M. Dias
- BioISI - Instituto de Biosistemas e Ciências Integrativas, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal,iXLab - Innovation for National Biological Resilience, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal,Corresponding author.
| |
Collapse
|
49
|
Cremers G, Jetten MSM, Op den Camp HJM, Lücker S. Metascan: METabolic Analysis, SCreening and ANnotation of Metagenomes. FRONTIERS IN BIOINFORMATICS 2022; 2:861505. [PMID: 36304333 PMCID: PMC9580885 DOI: 10.3389/fbinf.2022.861505] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 05/30/2022] [Indexed: 12/03/2022] Open
Abstract
Large scale next generation metagenomic sequencing of complex environmental samples paves the way for detailed analysis of nutrient cycles in ecosystems. For such an analysis, large scale unequivocal annotation is a prerequisite, which however is increasingly hampered by growing databases and analysis time. Hereto, we created a hidden Markov model (HMM) database by clustering proteins according to their KEGG indexing. HMM profiles for key genes of specific metabolic pathways and nutrient cycles were organized in subsets to be able to analyze each important elemental cycle separately. An important motivation behind the clustered database was to enable a high degree of resolution for annotation, while decreasing database size and analysis time. Here, we present Metascan, a new tool that can fully annotate and analyze deeply sequenced samples with an average analysis time of 11 min per genome for a publicly available dataset containing 2,537 genomes, and 1.1 min per genome for nutrient cycle analysis of the same sample. Metascan easily detected general proteins like cytochromes and ferredoxins, and additional pmoCAB operons were identified that were overlooked in previous analyses. For a mock community, the BEACON (F1) score was 0.72–0.93 compared to the information in NCBI GenBank. In combination with the accompanying database, Metascan provides a fast and useful annotation and analysis tool, as demonstrated by our proof-of-principle analysis of a complex mock community metagenome.
Collapse
|
50
|
Podrzaj L, Burtscher J, Domig KJ. Comparative Genomics Provides Insights Into Genetic Diversity of Clostridium tyrobutyricum and Potential Implications for Late Blowing Defects in Cheese. Front Microbiol 2022; 13:889551. [PMID: 35722315 PMCID: PMC9201417 DOI: 10.3389/fmicb.2022.889551] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 05/16/2022] [Indexed: 11/24/2022] Open
Abstract
Clostridium tyrobutyricum has been recognized as the main cause of late blowing defects (LBD) in cheese leading to considerable economic losses for the dairy industry. Although differences in spoilage ability among strains of this species have been acknowledged, potential links to the genetic diversity and functional traits remain unknown. In the present study, we aimed to investigate and characterize genomic variation, pan-genomic diversity and key traits of C. tyrobutyricum by comparing the genomes of 28 strains. A comparative genomics analysis revealed an “open” pangenome comprising 9,748 genes and a core genome of 1,179 genes shared by all test strains. Among those core genes, the majority of genes encode proteins related to translation, ribosomal structure and biogenesis, energy production and conversion, and amino acid metabolism. A large part of the accessory genome is composed of sets of unique, strain-specific genes ranging from about 5 to more than 980 genes. Furthermore, functional analysis revealed several strain-specific genes related to replication, recombination and repair, cell wall, membrane and envelope biogenesis, and defense mechanisms that might facilitate survival under stressful environmental conditions. Phylogenomic analysis divided strains into two clades: clade I contained human, mud, and silage isolates, whereas clade II comprised cheese and milk isolates. Notably, these two groups of isolates showed differences in certain hypothetical proteins, transcriptional regulators and ABC transporters involved in resistance to oxidative stress. To the best of our knowledge, this is the first study to provide comparative genomics of C. tyrobutyricum strains related to LBD. Importantly, the findings presented in this study highlight the broad genetic diversity of C. tyrobutyricum, which might help us understand the diversity in spoilage potential of C. tyrobutyricum in cheese and provide some clues for further exploring the gene modules responsible for the spoilage ability of this species.
Collapse
Affiliation(s)
- Lucija Podrzaj
- Department of Food Science and Technology, Institute of Food Science, University of Natural Resources and Life Sciences, Vienna, Austria
| | - Johanna Burtscher
- Department of Food Science and Technology, Institute of Food Science, University of Natural Resources and Life Sciences, Vienna, Austria
| | - Konrad J Domig
- Department of Food Science and Technology, Institute of Food Science, University of Natural Resources and Life Sciences, Vienna, Austria
| |
Collapse
|