1
|
Galperin MY, Vera Alvarez R, Karamycheva S, Makarova KS, Wolf YI, Landsman D, Koonin EV. COG database update 2024. Nucleic Acids Res 2024:gkae983. [PMID: 39494517 DOI: 10.1093/nar/gkae983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 10/08/2024] [Accepted: 10/14/2024] [Indexed: 11/05/2024] Open
Abstract
The Clusters of Orthologous Genes (COG) database, originally created in 1997, has been updated to reflect the constantly growing collection of completely sequenced prokaryotic genomes. This update increased the genome coverage from 1309 to 2296 species, including 2103 bacteria and 193 archaea, in most cases, with a single representative genome per genus. This set covers all genera of bacteria and archaea that included organisms with 'complete genomes' as per NCBI databases in November 2023. The number of COGs has been expanded from 4877 to 4981, primarily by including protein families involved in bacterial protein secretion. Accordingly, COG pathways and functional groups now include secretion systems of types II through X, as well as Flp/Tad and type IV pili. These groupings allow straightforward identification and examination of the prokaryotic lineages that encompass-or lack-a particular secretion system. Other developments include improved annotations for the rRNA and tRNA modification proteins, multi-domain signal transduction proteins, and some previously uncharacterized protein families. The new version of COGs is available at https://www.ncbi.nlm.nih.gov/research/COG, as well as on the NCBI FTP site https://ftp.ncbi.nlm.nih.gov/pub/COG/, which also provides archived data from previous COG releases.
Collapse
Affiliation(s)
- Michael Y Galperin
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Roberto Vera Alvarez
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Svetlana Karamycheva
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Kira S Makarova
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Yuri I Wolf
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - David Landsman
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Eugene V Koonin
- Computational Biology Branch, Division of Intramural Research, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
2
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknome". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for over 450 enzymes of unknown function from the model bacteria Escherichia coli uxgsing the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome. Article Summary Many proteins in any genome, ranging from 30 to 70%, lack an assigned function. This knowledge gap limits the full use of the vast available genomic data. Machine learning has shown promise in transferring functional knowledge from proteins of known functions to similar ones, but largely fails to predict novel functions not seen in its training data. Understanding these failures can guide the development of better machine-learning methods to help experts make accurate functional predictions for uncharacterized proteins.
Collapse
|
3
|
Selim KA, Alva V. PII-like signaling proteins: a new paradigm in orchestrating cellular homeostasis. Curr Opin Microbiol 2024; 79:102453. [PMID: 38678827 DOI: 10.1016/j.mib.2024.102453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 02/19/2024] [Accepted: 02/20/2024] [Indexed: 05/01/2024]
Abstract
Members of the PII superfamily are versatile, multitasking signaling proteins ubiquitously found in all domains of life. They adeptly monitor and synchronize the cell's carbon, nitrogen, energy, redox, and diurnal states, primarily by binding interdependently to adenyl-nucleotides, including charged nucleotides (ATP, ADP, and AMP) and second messengers such as cyclic adenosine monophosphate (cAMP), cyclic di-adenosine monophosphate (c-di-AMP), and S-adenosylmethionine-AMP (SAM-AMP). These proteins also undergo a variety of posttranslational modifications, such as phosphorylation, adenylation, uridylation, carboxylation, and disulfide bond formation, which further provide cues on the metabolic state of the cell. Serving as precise metabolic sensors, PII superfamily proteins transmit this information to diverse cellular targets, establishing dynamic regulatory assemblies that fine-tune cellular homeostasis. Recently discovered, PII-like proteins are emerging families of signaling proteins that, while related to canonical PII proteins, have evolved to fulfill a diverse range of cellular functions, many of which remain elusive. In this review, we focus on the evolution of PII-like proteins and summarize the molecular mechanisms governing the assembly dynamics of PII complexes, with a special emphasis on the PII-like protein SbtB.
Collapse
Affiliation(s)
- Khaled A Selim
- Microbiology / Molecular Physiology of Prokaryotes, Institute of Biology II, University of Freiburg, Schänzlestraße 1, 79104 Freiburg, Germany; Protein Evolution Department, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
| | - Vikram Alva
- Protein Evolution Department, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
4
|
Ibrahim IH. Metalloproteins and metalloproteomics in health and disease. ADVANCES IN PROTEIN CHEMISTRY AND STRUCTURAL BIOLOGY 2024; 141:123-176. [PMID: 38960472 DOI: 10.1016/bs.apcsb.2023.12.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/05/2024]
Abstract
Metalloproteins represents more than one third of human proteome, with huge variation in physiological functions and pathological implications, depending on the metal/metals involved and tissue context. Their functions range from catalysis, bioenergetics, redox, to DNA repair, cell proliferation, signaling, transport of vital elements, and immunity. The human metalloproteomic studies revealed that many families of metalloproteins along with individual metalloproteins are dysregulated under several clinical conditions. Also, several sorts of interaction between redox- active or redox- inert metalloproteins are observed in health and disease. Metalloproteins profiling shows distinct alterations in neurodegenerative diseases, cancer, inflammation, infection, diabetes mellitus, among other diseases. This makes metalloproteins -either individually or as families- a promising target for several therapeutic approaches. Inhibitors and activators of metalloenzymes, metal chelators, along with artificial metalloproteins could be versatile in diagnosis and treatment of several diseases, in addition to other biomedical and industrial applications.
Collapse
Affiliation(s)
- Iman Hassan Ibrahim
- Department of Biochemistry and Molecular Biology, Faculty of Pharmacy (Girls), Al-Azhar University, Cairo, Egypt.
| |
Collapse
|
5
|
Reed CJ, Denise R, Hourihan J, Babor J, Jaroch M, Martinelli M, Hutinet G, de Crécy-Lagard V. Beyond blast: enabling microbiologists to better extract literature, taxonomic distributions and gene neighbourhood information for protein families. Microb Genom 2024; 10:001183. [PMID: 38323604 PMCID: PMC10926702 DOI: 10.1099/mgen.0.001183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Accepted: 01/08/2024] [Indexed: 02/08/2024] Open
Abstract
Capturing the published corpus of information on all members of a given protein family should be an essential step in any study focusing on specific members of that family. Using a previously gathered dataset of more than 280 references mentioning a member of the DUF34 (NIF3/Ngg1-interacting Factor 3) family, we evaluated the efficiency of different databases and search tools, and devised a workflow that experimentalists can use to capture the most information published on members of a protein family in the least amount of time. To complement this workflow, web-based platforms allowing for the exploration of protein family members across sequenced genomes or for the analysis of gene neighbourhood information were reviewed for their versatility and ease of use. Recommendations that can be used for experimentalist users, as well as educators, are provided and integrated within a customized, publicly accessible Wiki.
Collapse
Affiliation(s)
- Colbie J. Reed
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Rémi Denise
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
- APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Jacob Hourihan
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Jill Babor
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Maria Martinelli
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, FL, USA
| | | | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
- Department of Biology, Haverford College, Haverford, PA, USA
- UF Genetics Institute, University of Florida, Gainesville, FL, USA
| |
Collapse
|
6
|
Reed CJ, Denise R, Hourihan J, Babor J, Jaroch M, Martinelli M, Hutinet G, de Crécy-Lagard V. Beyond Blast: Enabling Microbiologists to Better Extract Literature, Taxonomic Distributions and Gene Neighborhood Information for Protein Families. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.05.03.539116. [PMID: 37205517 PMCID: PMC10187207 DOI: 10.1101/2023.05.03.539116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
Capturing the published corpus of information on all members of a given protein family should be an essential step in any study focusing on specific members of that said family. Using a previously gathered dataset of more than 280 references mentioning a member of the DUF34 (NIF3/Ngg1-interacting Factor 3), we evaluated the efficiency of different databases and search tools, and devised a workflow that experimentalists can use to capture the most published information on members of a protein family in the least amount of time. To complement this workflow, web-based platforms allowing for the exploration of protein family members across sequenced genomes or for the analysis of gene neighborhood information were reviewed for their versatility and ease of use. Recommendations that can be used for experimentalist users, as well as educators, are provided and integrated within a customized, publicly accessible Wiki.
Collapse
Affiliation(s)
- Colbie J. Reed
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Rémi Denise
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Jacob Hourihan
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Jill Babor
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Maria Martinelli
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
| | - Geoffrey Hutinet
- Department of Biology, Haverford College, 370 Lancaster Avenue, Haverford, PA 19041, USA
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL 32611, USA
- Department of Biology, Haverford College, 370 Lancaster Avenue, Haverford, PA 19041, USA
- University of Florida Genetics Institute, Gainesville, FL 32610, USA
| |
Collapse
|
7
|
de Crécy-Lagard V, Swairjo MA. On the necessity to include multiple types of evidence when predicting molecular function of proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.18.571875. [PMID: 38187591 PMCID: PMC10769224 DOI: 10.1101/2023.12.18.571875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Machine learning-based platforms are currently revolutionizing many fields of molecular biology including structure prediction for monomers or complexes, predicting the consequences of mutations, or predicting the functions of proteins. However, these platforms use training sets based on currently available knowledge and, in essence, are not built to discover novelty. Hence, claims of discovering novel functions for protein families using artificial intelligence should be carefully dissected, as the dangers of overpredictions are real as we show in a detailed analysis of the prediction made by Kim et al 1 on the function of the YciO protein in the model organism Escherichia coli .
Collapse
|
8
|
Kanesaki Y, Ogura M. RNA-seq analysis identified glucose-responsive genes and YqfO as a global regulator in Bacillus subtilis. BMC Res Notes 2021; 14:450. [PMID: 34906218 PMCID: PMC8670212 DOI: 10.1186/s13104-021-05869-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 11/30/2021] [Indexed: 11/17/2022] Open
Abstract
Objective We observed that the addition of glucose enhanced the expression of sigX and sigM, encoding extra-cytoplasmic function sigma factors in Bacillus subtilis. Several regulatory factors were identified for this phenomenon, including YqfO, CshA (RNA helicase), and YlxR (nucleoid-associated protein). Subsequently, the relationships among these regulators were analyzed. Among them, YqfO is conserved in many bacterial genomes and may function as a metal ion insertase or metal chaperone, but has been poorly characterized. Thus, to further characterize YqfO, we performed RNA sequencing (RNA-seq) analysis of YqfO in addition to CshA and YlxR. Results We first performed comparative RNA-seq to detect the glucose-responsive genes. Next, to determine the regulatory effects of YqfO in addition to CshA and YlxR, three pairs of comparative RNA-seq analyses were performed (yqfO/wt, cshA/wt, and ylxR/wt). We observed relatively large regulons (approximately 420, 780, and 180 for YqfO, CshA, and YlxR, respectively) and significant overlaps, indicating close relationships among the three regulators. This study is the first to reveal that YqfO functions as a global regulator in B. subtilis. Supplementary Information The online version contains supplementary material available at 10.1186/s13104-021-05869-1.
Collapse
Affiliation(s)
- Yu Kanesaki
- Research Institute of Green Science and Technology, Shizuoka University, 836 Ohya, Suruga-ku, Shizuoka, 422-8529, Japan
| | - Mitsuo Ogura
- Institute of Oceanic Research and Development, Tokai University, 3-20-1 Orido Shimizu-ku, Shizuoka, 424-8610, Japan.
| |
Collapse
|