1
|
Lara P, Gama-Castro S, Salgado H, Rioualen C, Tierrafría VH, Muñiz-Rascado LJ, Bonavides-Martínez C, Collado-Vides J. Flexible gold standards for transcription factor regulatory interactions in Escherichia coli K-12: architecture of evidence types. Front Genet 2024; 15:1353553. [PMID: 38505828 PMCID: PMC10949920 DOI: 10.3389/fgene.2024.1353553] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2023] [Accepted: 02/09/2024] [Indexed: 03/21/2024] Open
Abstract
Post-genomic implementations have expanded the experimental strategies to identify elements involved in the regulation of transcription initiation. Here, we present for the first time a detailed analysis of the sources of knowledge supporting the collection of transcriptional regulatory interactions (RIs) of Escherichia coli K-12. An RI groups the transcription factor, its effect (positive or negative) and the regulated target, a promoter, a gene or transcription unit. We improved the evidence codes so that specific methods are incorporated and classified into independent groups. On this basis we updated the computation of confidence levels, weak, strong, or confirmed, for the collection of RIs. These updates enabled us to map the RI set to the current collection of HT TF-binding datasets from ChIP-seq, ChIP-exo, gSELEX and DAP-seq in RegulonDB, enriching in this way the evidence of close to one-quarter (1329) of RIs from the current total 5446 RIs. Based on the new computational capabilities of our improved annotation of evidence sources, we can now analyze the internal architecture of evidence, their categories (experimental, classical, HT, computational), and confidence levels. This is how we know that the joint contribution of HT and computational methods increase the overall fraction of reliable RIs (the sum of confirmed and strong evidence) from 49% to 71%. Thus, the current collection has 3912 reliable RIs, with 2718 or 70% of them with classical evidence which can be used to benchmark novel HT methods. Users can selectively exclude the method they want to benchmark, or keep for instance only the confirmed interactions. The recovery of regulatory sites in RegulonDB by the different HT methods ranges between 33% by ChIP-exo to 76% by ChIP-seq although as discussed, many potential confounding factors limit their interpretation. The collection of improvements reported here provides a solid foundation to incorporate new methods and data, and to further integrate the diverse sources of knowledge of the different components of the transcriptional regulatory network. There is no other genomic database that offers this comprehensive high-quality architecture of knowledge supporting a corpus of transcriptional regulatory interactions.
Collapse
Affiliation(s)
- Paloma Lara
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
| | - Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
| | - Claire Rioualen
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
| | - Víctor H. Tierrafría
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Luis J. Muñiz-Rascado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
| | - César Bonavides-Martínez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad S/N, Cuernavaca, Mexico
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- Center for Genomic Regulation, The Barcelona Institute of Science and Technology, Universitat Pompeu Fabra, Barcelona, Spain
| |
Collapse
|
2
|
Salgado H, Gama-Castro S, Lara P, Mejia-Almonte C, Alarcón-Carranza G, López-Almazo AG, Betancourt-Figueroa F, Peña-Loredo P, Alquicira-Hernández S, Ledezma-Tejeida D, Arizmendi-Zagal L, Mendez-Hernandez F, Diaz-Gomez AK, Ochoa-Praxedis E, Muñiz-Rascado LJ, García-Sotelo JS, Flores-Gallegos FA, Gómez L, Bonavides-Martínez C, del Moral-Chávez VM, Hernández-Alvarez AJ, Santos-Zavaleta A, Capella-Gutierrez S, Gelpi JL, Collado-Vides J. RegulonDB v12.0: a comprehensive resource of transcriptional regulation in E. coli K-12. Nucleic Acids Res 2024; 52:D255-D264. [PMID: 37971353 PMCID: PMC10767902 DOI: 10.1093/nar/gkad1072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/25/2023] [Accepted: 11/02/2023] [Indexed: 11/19/2023] Open
Abstract
RegulonDB is a database that contains the most comprehensive corpus of knowledge of the regulation of transcription initiation of Escherichia coli K-12, including data from both classical molecular biology and high-throughput methodologies. Here, we describe biological advances since our last NAR paper of 2019. We explain the changes to satisfy FAIR requirements. We also present a full reconstruction of the RegulonDB computational infrastructure, which has significantly improved data storage, retrieval and accessibility and thus supports a more intuitive and user-friendly experience. The integration of graphical tools provides clear visual representations of genetic regulation data, facilitating data interpretation and knowledge integration. RegulonDB version 12.0 can be accessed at https://regulondb.ccg.unam.mx.
Collapse
Affiliation(s)
- Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Paloma Lara
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Citlalli Mejia-Almonte
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Gabriel Alarcón-Carranza
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Andrés G López-Almazo
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Felipe Betancourt-Figueroa
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Pablo Peña-Loredo
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | | | - Daniela Ledezma-Tejeida
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Lizeth Arizmendi-Zagal
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Francisco Mendez-Hernandez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Ana K Diaz-Gomez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Elizabeth Ochoa-Praxedis
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Luis J Muñiz-Rascado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Jair S García-Sotelo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Querétaro 76230, Querétaro, Mexico
| | - Fanny A Flores-Gallegos
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Laura Gómez
- Instituto Nacional de Medicina Genómica, Periférico Sur 4809, Arenal Tepepan, Tlalpan, 14610 Ciudad de México, Mexico
- Escuela de Medicina, Tecnológico de Monterrey, Campus Ciudad de México, CDMX 14380, Meéxico
| | - César Bonavides-Martínez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Víctor M del Moral-Chávez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | | | - Alberto Santos-Zavaleta
- Instituto de Energías Renovables, Universidad Nacional Autónoma de México, Temixco, Morelos 62580, Meéxico
| | | | - Josep Lluis Gelpi
- Department of Biochemistry and Molecular Biomedicine. Univ. of Barcelona. Av. Diagonal 643, 08028, Barcelona, Spain
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra(UPF), Dr. Aiguader 88, Barcelona, 08003, Barcelona, Spain
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
- Centre for Genomic Regulation (CRG), Universitat Pompeu Fabra(UPF), Dr. Aiguader 88, Barcelona, 08003, Barcelona, Spain
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall. Boston, MA 02215, USA
| |
Collapse
|
3
|
Karp PD, Paley S, Caspi R, Kothari A, Krummenacker M, Midford PE, Moore LR, Subhraveti P, Gama-Castro S, Tierrafria VH, Lara P, Muñiz-Rascado L, Bonavides-Martinez C, Santos-Zavaleta A, Mackie A, Sun G, Ahn-Horst TA, Choi H, Covert MW, Collado-Vides J, Paulsen I. The EcoCyc Database (2023). EcoSal Plus 2023; 11:eesp00022023. [PMID: 37220074 DOI: 10.1128/ecosalplus.esp-0002-2023] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Accepted: 04/04/2023] [Indexed: 01/28/2024]
Abstract
EcoCyc is a bioinformatics database available online at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project is to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene product, metabolite, reaction, operon, and metabolic pathway. The database also includes information on the regulation of gene expression, E. coli gene essentiality, and nutrient conditions that do or do not support the growth of E. coli. The website and downloadable software contain tools for the analysis of high-throughput data sets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc and can be executed online. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. Data generated from a whole-cell model that is parameterized from the latest data on EcoCyc are also available. This review outlines the data content of EcoCyc and of the procedures by which this content is generated.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Ron Caspi
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Anamika Kothari
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Markus Krummenacker
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Peter E Midford
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Lisa R Moore
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Pallavi Subhraveti
- Bioinformatics Research Group, SRI International, Menlo Park, California, USA
| | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Victor H Tierrafria
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Paloma Lara
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - César Bonavides-Martinez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Amanda Mackie
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, New South Wales, Australia
| | - Gwanggyu Sun
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Travis A Ahn-Horst
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Heejo Choi
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Markus W Covert
- Department of Bioengineering, Stanford University, Stanford, California, USA
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Ian Paulsen
- School of Natural Sciences, Macquarie University, Sydney, New South Wales, Australia
| |
Collapse
|
4
|
Lara P, Gama-Castro S, Salgado H, Rioualen C, Tierrafría VH, Muñiz-Rascado LJ, Bonavides-Martínez C, Collado-Vides J. A Gold Standard for Transcription Factor Regulatory Interactions in Escherichia coli K-12: Architecture of Evidence Types. bioRxiv 2023:2023.02.25.530038. [PMID: 37163020 PMCID: PMC10168212 DOI: 10.1101/2023.02.25.530038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Post-genomic implementations have expanded the experimental strategies to identify elements involved in the regulation of transcription initiation. As new methodologies emerge, a natural step is to compare their results with those from established methodologies, such as the classic methods of molecular biology used to characterize transcription factor binding sites, promoters, or transcription units. In the case of Escherichia coli K-12, the best-studied microorganism, for the last 30 years we have continuously gathered such knowledge from original scientific publications, and have organized it in two databases, RegulonDB and EcoCyc. Furthermore, since RegulonDB version 11.0 (1), we offer comprehensive datasets of binding sites from chromatin immunoprecipitation combined with sequencing (ChIP-seq), ChIP combined with exonuclease digestion and next-generation sequencing (ChIP-exo), genomic SELEX screening (gSELEX), and DNA affinity purification sequencing (DAP-seq) HT technologies, as well as additional datasets for transcription start sites, transcription units and RNA sequencing (RNA-seq) expression profiles. Here, we present for the first time an analysis of the sources of knowledge supporting the collection of transcriptional regulatory interactions (RIs) of E. coli K-12. An RI is formed by the transcription factor, its positive or negative effect on a promoter, a gene or transcription unit. We improved the evidence codes so that the specific methods are described, and we classified them into seven independent groups. This is the basis for our updated computation of confidence levels, weak, strong, or confirmed, for the collection of RIs. We compare the confidence levels of the RI collection before and after adding HT evidence illustrating how knowledge will change as more HT data and methods appear in the future. Users can generate subsets filtering out the method they want to benchmark and avoid circularity, or keep for instance only the confirmed interactions. The comparison of different HT methods with the available datasets indicate that ChIP-seq recovers the highest fraction (>70%) of binding sites present in RegulonDB followed by gSELEX, DAP-seq and ChIP-exo. There is no other genomic database that offers this comprehensive high-quality anatomy of evidence supporting a corpus of transcriptional regulatory interactions.
Collapse
|
5
|
Collado-Vides J. Regulatory promoter architectures in the hands of thermodynamic modelling. Nat Rev Genet 2023; 24:349. [PMID: 36747003 DOI: 10.1038/s41576-023-00578-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Affiliation(s)
- Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Morelos, Mexico. .,Department of Biomedical Engineering, Boston University, Boston, MA, USA. .,Center for Genomic Regulation (CRG), Universitat Pompeu Fabra, Barcelona, Spain.
| |
Collapse
|
6
|
Tierrafría VH, Rioualen C, Salgado H, Lara P, Gama-Castro S, Lally P, Gómez-Romero L, Peña-Loredo P, López-Almazo AG, Alarcón-Carranza G, Betancourt-Figueroa F, Alquicira-Hernández S, Polanco-Morelos JE, García-Sotelo J, Gaytan-Nuñez E, Méndez-Cruz CF, Muñiz LJ, Bonavides-Martínez C, Moreno-Hagelsieb G, Galagan JE, Wade JT, Collado-Vides J. RegulonDB 11.0: Comprehensive high-throughput datasets on transcriptional regulation in Escherichia coli K-12. Microb Genom 2022; 8. [PMID: 35584008 PMCID: PMC9465075 DOI: 10.1099/mgen.0.000833] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Genomics has set the basis for a variety of methodologies that produce high-throughput datasets identifying the different players that define gene regulation, particularly regulation of transcription initiation and operon organization. These datasets are available in public repositories, such as the Gene Expression Omnibus, or ArrayExpress. However, accessing and navigating such a wealth of data is not straightforward. No resource currently exists that offers all available high and low-throughput data on transcriptional regulation in Escherichia coli K-12 to easily use both as whole datasets, or as individual interactions and regulatory elements. RegulonDB (https://regulondb.ccg.unam.mx) began gathering high-throughput dataset collections in 2009, starting with transcription start sites, then adding ChIP-seq and gSELEX in 2012, with up to 99 different experimental high-throughput datasets available in 2019. In this paper we present a radical upgrade to more than 2000 high-throughput datasets, processed to facilitate their comparison, introducing up-to-date collections of transcription termination sites, transcription units, as well as transcription factor binding interactions derived from ChIP-seq, ChIP-exo, gSELEX and DAP-seq experiments, besides expression profiles derived from RNA-seq experiments. For ChIP-seq experiments we offer both the data as presented by the authors, as well as data uniformly processed in-house, enhancing their comparability, as well as the traceability of the methods and reproducibility of the results. Furthermore, we have expanded the tools available for browsing and visualization across and within datasets. We include comparisons against previously existing knowledge in RegulonDB from classic experiments, a nucleotide-resolution genome viewer, and an interface that enables users to browse datasets by querying their metadata. A particular effort was made to automatically extract detailed experimental growth conditions by implementing an assisted curation strategy applying Natural language processing and machine learning. We provide summaries with the total number of interactions found in each experiment, as well as tools to identify common results among different experiments. This is a long-awaited resource to make use of such wealth of knowledge and advance our understanding of the biology of the model bacterium E. coli K-12.
Collapse
Affiliation(s)
- Víctor H Tierrafría
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico.,Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215, USA
| | - Claire Rioualen
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Paloma Lara
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Patrick Lally
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215, USA
| | - Laura Gómez-Romero
- Instituto Nacional de Medicina Genómica, INMEGEN, Periférico Sur 4809, Arenal Tepepan, Tlalpan 14610, CDMX, Mexico
| | - Pablo Peña-Loredo
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Andrés G López-Almazo
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Gabriel Alarcón-Carranza
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Felipe Betancourt-Figueroa
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Shirley Alquicira-Hernández
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - J Enrique Polanco-Morelos
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Jair García-Sotelo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Querétaro 76230, Querétaro, Mexico
| | - Estefani Gaytan-Nuñez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Carlos-Francisco Méndez-Cruz
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Luis J Muñiz
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - César Bonavides-Martínez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico
| | - Gabriel Moreno-Hagelsieb
- Department of Biology, Wilfrid Laurier University, 75 University Ave W, Waterloo, ON N2L 3C5, Canada
| | - James E Galagan
- Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215, USA
| | - Joseph T Wade
- Wadsworth Center, New York State Department of Health, Albany, NY, USA.,Department of Biomedical Sciences, University at Albany, SUNY, Albany, NY, USA
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n, Cuernavaca 62210, Morelos, Mexico.,Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Boston, MA 02215, USA.,Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Universitat Pompeu Fabra(UPF), Barcelona, Spain
| |
Collapse
|
7
|
Collado-Vides J, Gaudet P, de Lorenzo V. Missing Links Between Gene Function and Physiology in Genomics. Front Physiol 2022; 13:815874. [PMID: 35295568 PMCID: PMC8918662 DOI: 10.3389/fphys.2022.815874] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 01/28/2022] [Indexed: 11/25/2022] Open
Abstract
Knowledge of biological organisms at the molecular level that has been gathered is now organized into databases, often within ontological frameworks. To enable computational comparisons of annotations across different genomes and organisms, controlled vocabularies have been essential, as is the case in the functional annotation classifications used for bacteria, such as MultiFun and the more widely used Gene Ontology. The function of individual gene products as well as the processes in which collections of them participate constitute a wealth of classes that describe the biological role of gene products in a large number of organisms in the three kingdoms of life. In this contribution, we highlight from a qualitative perspective some limitations of these frameworks and discuss challenges that need to be addressed to bridge the gap between annotation as currently captured by ontologies and databases and our understanding of the basic principles in the organization and functioning of organisms; we illustrate these challenges with some examples in bacteria. We hope that raising awareness of these issues will encourage users of Gene Ontology and similar ontologies to be careful about data interpretation and lead to improved data representation.
Collapse
Affiliation(s)
- Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- Centre for Genomic Regulation, The Barcelona Institute of Science and Technology, Universitat Pompeu Fabra, Barcelona, Spain
- *Correspondence: Julio Collado-Vides,
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, Swiss-Prot Group, Geneva, Switzerland
| | - Víctor de Lorenzo
- Department of Systems Biology, Centro Nacional de Biotecnología CSIC, Universidad Autónoma de Madrid, Madrid, Spain
| |
Collapse
|
8
|
Femerling G, Gama-Castro S, Lara P, Ledezma-Tejeida D, Tierrafría VH, Muñiz-Rascado L, Bonavides-Martínez C, Collado-Vides J. Sensory Systems and Transcriptional Regulation in Escherichia coli. Front Bioeng Biotechnol 2022; 10:823240. [PMID: 35237580 PMCID: PMC8882922 DOI: 10.3389/fbioe.2022.823240] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2021] [Accepted: 01/18/2022] [Indexed: 11/13/2022] Open
Abstract
In free-living bacteria, the ability to regulate gene expression is at the core of adapting and interacting with the environment. For these systems to have a logic, a signal must trigger a genetic change that helps the cell to deal with what implies its presence in the environment; briefly, the response is expected to include a feedback to the signal. Thus, it makes sense to think of genetic sensory mechanisms of gene regulation. Escherichia coli K-12 is the bacterium model for which the largest number of regulatory systems and its sensing capabilities have been studied in detail at the molecular level. In this special issue focused on biomolecular sensing systems, we offer an overview of the transcriptional regulatory corpus of knowledge for E. coli that has been gathered in our database, RegulonDB, from the perspective of sensing regulatory systems. Thus, we start with the beginning of the information flux, which is the signal’s chemical or physical elements detected by the cell as changes in the environment; these signals are internally transduced to transcription factors and alter their conformation. Signals transduced to effectors bind allosterically to transcription factors, and this defines the dominant sensing mechanism in E. coli. We offer an updated list of the repertoire of known allosteric effectors, as well as a list of the currently known different mechanisms of this sensing capability. Our previous definition of elementary genetic sensory-response units, GENSOR units for short, that integrate signals, transport, gene regulation, and the biochemical response of the regulated gene products of a given transcriptional factor fit perfectly with the purpose of this overview. We summarize the functional heterogeneity of their response, based on our updated collection of GENSORs, and we use them to identify the expected feedback as part of their response. Finally, we address the question of multiple sensing in the regulatory network of E. coli. This overview introduces the architecture of sensing and regulation of native components in E.coli K-12, which might be a source of inspiration to bioengineering applications.
Collapse
Affiliation(s)
- Georgette Femerling
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Paloma Lara
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | | | - Víctor H. Tierrafría
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Luis Muñiz-Rascado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | | | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Universitat Pompeu Fabra (UPF), Barcelona, Spain
- *Correspondence: Julio Collado-Vides,
| |
Collapse
|
9
|
Díaz-Rodríguez M, Lithgow-Serrano O, Guadarrama-García F, Tierrafría VH, Gama-Castro S, Solano-Lira H, Salgado H, Rinaldi F, Méndez-Cruz CF, Collado-Vides J. Lisen&Curate: A platform to facilitate gathering textual evidence for curation of regulation of transcription initiation in bacteria. Biochim Biophys Acta Gene Regul Mech 2021; 1864:194753. [PMID: 34461312 PMCID: PMC10155859 DOI: 10.1016/j.bbagrm.2021.194753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2021] [Revised: 07/12/2021] [Accepted: 08/25/2021] [Indexed: 10/20/2022]
Abstract
The number of published papers in biomedical research makes it rather impossible for a researcher to keep up to date. This is where manually curated databases contribute facilitating the access to knowledge. However, the structure required by databases strongly limits the type of valuable information that can be incorporated. Here, we present Lisen&Curate, a curation system that facilitates linking sentences or part of sentences (both considered sources) in articles with their corresponding curated objects, so that rich additional information of these objects is easily available to users. These sources are going to be offered both within RegulonDB and a new database, L-Regulon. To show the relevance of our work, two senior curators performed a curation of 31 articles on the regulation of transcription initiation of E. coli using Lisen&Curate. As a result, 194 objects were curated and 781 sources were recorded. We also found that these sources are useful to develop automatic approaches to detect objects in articles by observing word frequency patterns and by carrying out an open information extraction task. Sources may help to elaborate a controlled vocabulary of experimental methods. Finally, we discuss our ecosystem of interconnected applications, RegulonDB, L-Regulon, and Lisen&Curate, to facilitate the access to knowledge on regulation of transcription initiation in bacteria. We see our proposal as the starting point to change the way experimentalists connect a piece of knowledge with its evidence using RegulonDB.
Collapse
Affiliation(s)
- Martín Díaz-Rodríguez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Oscar Lithgow-Serrano
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico; Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI, Polo universitario Lugano-Campus Est, Via la Santa 1, CH-6962 Lugano, Switzerland
| | - Francisco Guadarrama-García
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Víctor H Tierrafría
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Hilda Solano-Lira
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico
| | - Fabio Rinaldi
- Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI, Polo universitario Lugano-Campus Est, Via la Santa 1, CH-6962 Lugano, Switzerland; Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
| | - Carlos-Francisco Méndez-Cruz
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico.
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Avenida Universidad s/n Col. Chamilpa, 62210 Cuernavaca, Mor., Mexico; Department of Biomedical Engineering, Boston University, 44 Cummington Mall Room 403, 02215 Boston, MA, USA; Center for Genomic Regulation (CRG), Dr. Aiguader 88, 08003, Barcelona, Spain
| |
Collapse
|
10
|
Keseler IM, Gama-Castro S, Mackie A, Billington R, Bonavides-Martínez C, Caspi R, Kothari A, Krummenacker M, Midford PE, Muñiz-Rascado L, Ong WK, Paley S, Santos-Zavaleta A, Subhraveti P, Tierrafría VH, Wolfe AJ, Collado-Vides J, Paulsen IT, Karp PD. The EcoCyc Database in 2021. Front Microbiol 2021; 12:711077. [PMID: 34394059 PMCID: PMC8357350 DOI: 10.3389/fmicb.2021.711077] [Citation(s) in RCA: 99] [Impact Index Per Article: 33.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Accepted: 07/02/2021] [Indexed: 11/13/2022] Open
Abstract
The EcoCyc model-organism database collects and summarizes experimental data for Escherichia coli K-12. EcoCyc is regularly updated by the manual curation of individual database entries, such as genes, proteins, and metabolic pathways, and by the programmatic addition of results from select high-throughput analyses. Updates to the Pathway Tools software that supports EcoCyc and to the web interface that enables user access have continuously improved its usability and expanded its functionality. This article highlights recent improvements to the curated data in the areas of metabolism, transport, DNA repair, and regulation of gene expression. New and revised data analysis and visualization tools include an interactive metabolic network explorer, a circular genome viewer, and various improvements to the speed and usability of existing tools.
Collapse
Affiliation(s)
- Ingrid M. Keseler
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Amanda Mackie
- Department of Molecular Sciences, Macquarie University, Sydney, NSW, Australia
| | - Richard Billington
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | | | - Ron Caspi
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Anamika Kothari
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Markus Krummenacker
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Peter E. Midford
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Luis Muñiz-Rascado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Wai Kit Ong
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Suzanne Paley
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Alberto Santos-Zavaleta
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
- Instituto de Energías Renovables, Universidad Nacional Autónoma de México, Temixco, México
| | - Pallavi Subhraveti
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| | - Víctor H. Tierrafría
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Alan J. Wolfe
- Department of Microbiology and Immunology, Stritch School of Medicine, Loyola University Chicago, Maywood, IL, United States
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Ian T. Paulsen
- Department of Molecular Sciences, Macquarie University, Sydney, NSW, Australia
| | - Peter D. Karp
- Bioinformatics Research Group, Artificial Intelligence Center, SRI International, Menlo Park, CA, United States
| |
Collapse
|
11
|
Méndez-Cruz CF, Blanchet A, Godínez A, Arroyo-Fernández I, Gama-Castro S, Martínez-Luna SB, González-Colín C, Collado-Vides J. Knowledge extraction for assisted curation of summaries of bacterial transcription factor properties. Database (Oxford) 2020; 2020:6029376. [PMID: 33306798 PMCID: PMC7731926 DOI: 10.1093/database/baaa109] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2020] [Revised: 11/18/2020] [Accepted: 11/26/2020] [Indexed: 11/21/2022]
Abstract
Transcription factors (TFs) play a main role in transcriptional regulation of bacteria, as they regulate transcription of the genetic information encoded in DNA. Thus, the curation of the properties of these regulatory proteins is essential for a better understanding of transcriptional regulation. However, traditional manual curation of article collections to compile descriptions of TF properties takes significant time and effort due to the overwhelming amount of biomedical literature, which increases every day. The development of automatic approaches for knowledge extraction to assist curation is therefore critical. Here, we show an effective approach for knowledge extraction to assist curation of summaries describing bacterial TF properties based on an automatic text summarization strategy. We were able to recover automatically a median 77% of the knowledge contained in manual summaries describing properties of 177 TFs of Escherichia coli K-12 by processing 5961 scientific articles. For 71% of the TFs, our approach extracted new knowledge that can be used to expand manual descriptions. Furthermore, as we trained our predictive model with manual summaries of E. coli, we also generated summaries for 185 TFs of Salmonella enterica serovar Typhimurium from 3498 articles. According to the manual curation of 10 of these Salmonella typhimurium summaries, 96% of their sentences contained relevant knowledge. Our results demonstrate the feasibility to assist manual curation to expand manual summaries with new knowledge automatically extracted and to create new summaries of bacteria for which these curation efforts do not exist. Database URL: The automatic summaries of the TFs of E. coli and Salmonella and the automatic summarizer are available in GitHub (https://github.com/laigen-unam/tf-properties-summarizer.git).
Collapse
Affiliation(s)
- Carlos-Francisco Méndez-Cruz
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico
| | - Antonio Blanchet
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico
| | - Alan Godínez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico
| | - Ignacio Arroyo-Fernández
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico.,División de Posgrado, Universidad Tecnológica de la Mixteca, Carretera a Acatlima Km. 2.5, Huajuapan de León, 69000, Oaxaca, Mexico
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico
| | - Sara Berenice Martínez-Luna
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico
| | - Cristian González-Colín
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Av. Universidad s/n, Colonia Chamilpa, Cuernavaca 62100, Morelos, Mexico.,Department of Biomedical Engineering, Boston University, 44 Cummington Mall, Room 403, Boston, 02215 MA, USA
| |
Collapse
|
12
|
Mejía-Almonte C, Busby SJW, Wade JT, van Helden J, Arkin AP, Stormo GD, Eilbeck K, Palsson BO, Galagan JE, Collado-Vides J. Redefining fundamental concepts of transcription initiation in bacteria. Nat Rev Genet 2020; 21:699-714. [PMID: 32665585 PMCID: PMC7990032 DOI: 10.1038/s41576-020-0254-8] [Citation(s) in RCA: 77] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/29/2020] [Indexed: 12/15/2022]
Abstract
Despite enormous progress in understanding the fundamentals of bacterial gene regulation, our knowledge remains limited when compared with the number of bacterial genomes and regulatory systems to be discovered. Derived from a small number of initial studies, classic definitions for concepts of gene regulation have evolved as the number of characterized promoters has increased. Together with discoveries made using new technologies, this knowledge has led to revised generalizations and principles. In this Expert Recommendation, we suggest precise, updated definitions that support a logical, consistent conceptual framework of bacterial gene regulation, focusing on transcription initiation. The resulting concepts can be formalized by ontologies for computational modelling, laying the foundation for improved bioinformatics tools, knowledge-based resources and scientific communication. Thus, this work will help researchers construct better predictive models, with different formalisms, that will be useful in engineering, synthetic biology, microbiology and genetics.
Collapse
Affiliation(s)
- Citlalli Mejía-Almonte
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Morelos, Cuernavaca, México
| | | | - Joseph T Wade
- Division of Genetics, Wadsworth Center, New York State Department of Health, Albany, NY, USA
| | - Jacques van Helden
- Aix-Marseille University, INSERM UMR S 1090, Theory and Approaches of Genome Complexity (TAGC), Marseille, France
- CNRS, Institut Français de Bioinformatique, IFB-core, UMS 3601, Evry, France
| | - Adam P Arkin
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA, USA
| | - Gary D Stormo
- Department of Genetics, Washington University School of Medicine, St Louis, MO, USA
| | - Karen Eilbeck
- Department of Biomedical Informatics, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Bernhard O Palsson
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, USA
| | - James E Galagan
- Department of Biomedical Engineering, Boston University, Boston, MA, USA
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Morelos, Cuernavaca, México.
- Department of Biomedical Engineering, Boston University, Boston, MA, USA.
| |
Collapse
|
13
|
Santos-Zavaleta A, Salgado H, Gama-Castro S, Sánchez-Pérez M, Gómez-Romero L, Ledezma-Tejeida D, García-Sotelo JS, Alquicira-Hernández K, Muñiz-Rascado LJ, Peña-Loredo P, Ishida-Gutiérrez C, Velázquez-Ramírez DA, Del Moral-Chávez V, Bonavides-Martínez C, Méndez-Cruz CF, Galagan J, Collado-Vides J. RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12. Nucleic Acids Res 2020; 47:D212-D220. [PMID: 30395280 PMCID: PMC6324031 DOI: 10.1093/nar/gky1077] [Citation(s) in RCA: 217] [Impact Index Per Article: 54.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2018] [Accepted: 10/19/2018] [Indexed: 01/31/2023] Open
Abstract
RegulonDB, first published 20 years ago, is a comprehensive electronic resource about regulation of transcription initiation of Escherichia coli K-12 with decades of knowledge from classic molecular biology experiments, and recently also from high-throughput genomic methodologies. We curated the literature to keep RegulonDB up to date, and initiated curation of ChIP and gSELEX experiments. We estimate that current knowledge describes between 10% and 30% of the expected total number of transcription factor- gene regulatory interactions in E. coli. RegulonDB provides datasets for interactions for which there is no evidence that they affect expression, as well as expression datasets. We developed a proof of concept pipeline to merge binding and expression evidence to identify regulatory interactions. These datasets can be visualized in the RegulonDB JBrowse. We developed the Microbial Conditions Ontology with a controlled vocabulary for the minimal properties to reproduce an experiment, which contributes to integrate data from high throughput and classic literature. At a higher level of integration, we report Genetic Sensory-Response Units for 200 transcription factors, including their regulation at the metabolic level, and include summaries for 70 of them. Finally, we summarize our research with Natural language processing strategies to enhance our biocuration work.
Collapse
Affiliation(s)
- Alberto Santos-Zavaleta
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Mishael Sánchez-Pérez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Laura Gómez-Romero
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Daniela Ledezma-Tejeida
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | | | - Kevin Alquicira-Hernández
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Luis José Muñiz-Rascado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Pablo Peña-Loredo
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Cecilia Ishida-Gutiérrez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - David A Velázquez-Ramírez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Víctor Del Moral-Chávez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - César Bonavides-Martínez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | | | - James Galagan
- Department of Biomedical Engineering, Boston University, Boston, MA, USA
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México.,Department of Biomedical Engineering, Boston University, Boston, MA, USA
| |
Collapse
|
14
|
Ledezma-Tejeida D, Altamirano-Pacheco L, Fajardo V, Collado-Vides J. Limits to a classic paradigm: most transcription factors in E. coli regulate genes involved in multiple biological processes. Nucleic Acids Res 2020; 47:6656-6667. [PMID: 31194874 PMCID: PMC6649764 DOI: 10.1093/nar/gkz525] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2019] [Revised: 05/29/2019] [Accepted: 06/04/2019] [Indexed: 01/12/2023] Open
Abstract
Transcription factors (TFs) are important drivers of cellular decision-making. When bacteria encounter a change in the environment, TFs alter the expression of a defined set of genes in order to adequately respond. It is commonly assumed that genes regulated by the same TF are involved in the same biological process. Examples of this are methods that rely on coregulation to infer function of not-yet-annotated genes. We have previously shown that only 21% of TFs involved in metabolism regulate functionally homogeneous genes, based on the proximity of the gene products’ catalyzed reactions in the metabolic network. Here, we provide more evidence to support the claim that a 1-TF/1-process relationship is not a general property. We show that the observed functional heterogeneity of regulons is not a result of the quality of the annotation of regulatory interactions, nor the absence of protein–metabolite interactions, and that it is also present when function is defined by Gene Ontology terms. Furthermore, the observed functional heterogeneity is different from the one expected by chance, supporting the notion that it is a biological property. To further explore the relationship between transcriptional regulation and metabolism, we analyzed five other types of regulatory groups and identified complex regulons (i.e. genes regulated by the same combination of TFs) as the most functionally homogeneous, and this is supported by coexpression data. Whether higher levels of related functions exist beyond metabolism and current functional annotations remains an open question.
Collapse
Affiliation(s)
- Daniela Ledezma-Tejeida
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico.,Department of Biology, Institute of Molecular Systems Biology, ETH Zürich, Zurich, Switzerland
| | - Luis Altamirano-Pacheco
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Vicente Fajardo
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico.,Department of Biomedical Engineering, Boston University, Boston, MA, USA
| |
Collapse
|
15
|
Lithgow-Serrano O, Gama-Castro S, Ishida-Gutiérrez C, Mejía-Almonte C, Tierrafría VH, Martínez-Luna S, Santos-Zavaleta A, Velázquez-Ramírez D, Collado-Vides J. Similarity corpus on microbial transcriptional regulation. J Biomed Semantics 2019; 10:8. [PMID: 31118102 PMCID: PMC6532127 DOI: 10.1186/s13326-019-0200-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 04/16/2019] [Indexed: 12/02/2022] Open
Abstract
Background The ability to express the same meaning in different ways is a well-known property of natural language. This amazing property is the source of major difficulties in natural language processing. Given the constant increase in published literature, its curation and information extraction would strongly benefit from efficient automatic processes, for which corpora of sentences evaluated by experts are a valuable resource. Results Given our interest in applying such approaches to the benefit of curation of the biomedical literature, specifically that about gene regulation in microbial organisms, we decided to build a corpus with graded textual similarity evaluated by curators and that was designed specifically oriented to our purposes. Based on the predefined statistical power of future analyses, we defined features of the design, including sampling, selection criteria, balance, and size, among others. A non-fully crossed study design was applied. Each pair of sentences was evaluated by 3 annotators from a total of 7; the scale used in the semantic similarity assessment task within the Semantic Evaluation workshop (SEMEVAL) was adapted to our goals in four successive iterative sessions with clear improvements in the agreed guidelines and interrater reliability results. Alternatives for such a corpus evaluation have been widely discussed. Conclusions To the best of our knowledge, this is the first similarity corpus—a dataset of pairs of sentences for which human experts rate the semantic similarity of each pair—in this domain of knowledge. We have initiated its incorporation in our research towards high-throughput curation strategies based on natural language processing.
Collapse
Affiliation(s)
- Oscar Lithgow-Serrano
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México. .,Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (IIMAS), Universidad Nacional Autónoma de México (UNAM), Mexico City, México.
| | - Socorro Gama-Castro
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México
| | - Cecilia Ishida-Gutiérrez
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México
| | - Citlalli Mejía-Almonte
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México
| | - Víctor H Tierrafría
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México
| | - Sara Martínez-Luna
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México
| | - Alberto Santos-Zavaleta
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México
| | - David Velázquez-Ramírez
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México
| | - Julio Collado-Vides
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM). A.P., 565-A Cuernavaca, Morelos, 62100, México.,Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA
| |
Collapse
|
16
|
Affiliation(s)
- Oscar Lithgow-Serrano
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM), Morelos, México
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas (IIMAS), Universidad Nacional Autónoma de México (UNAM), Ciudad de México, México
| | - Julio Collado-Vides
- Computational Genomics, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México (UNAM), Morelos, México
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts, USA
| |
Collapse
|
17
|
Santos-Zavaleta A, Pérez-Rueda E, Sánchez-Pérez M, Velázquez-Ramírez DA, Collado-Vides J. Tracing the phylogenetic history of the Crl regulon through the Bacteria and Archaea genomes. BMC Genomics 2019; 20:299. [PMID: 30991941 PMCID: PMC6469107 DOI: 10.1186/s12864-019-5619-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Accepted: 03/18/2019] [Indexed: 02/08/2023] Open
Abstract
Background Crl, identified for curli production, is a small transcription factor that stimulates the association of the σS factor (RpoS) with the RNA polymerase core through direct and specific interactions, increasing the transcription rate of genes during the transition from exponential to stationary phase at low temperatures, using indole as an effector molecule. The lack of a comprehensive collection of information on the Crl regulon makes it difficult to identify a dominant function of Crl and to generate any hypotheses concerning its taxonomical distribution in archaeal and bacterial organisms. Results In this work, based on a systematic literature review, we identified the first comprehensive dataset of 86 genes under the control of Crl in the bacterium Escherichia coli K-12; those genes correspond to 40% of the σS regulon in this bacterium. Based on an analysis of orthologs in 18 archaeal and 69 bacterial taxonomical divisions and using E. coli K-12 as a framework, we suggest three main events that resulted in this regulon’s actual form: (i) in a first step, rpoS, a gene widely distributed in bacteria and archaea cellular domains, was recruited to regulate genes involved in ancient metabolic processes, such as those associated with glycolysis and the tricarboxylic acid cycle; (ii) in a second step, the regulon recruited those genes involved in metabolic processes, which are mainly taxonomically constrained to Proteobacteria, with some secondary losses, such as those genes involved in responses to stress or starvation and cell adhesion, among others; and (iii) in a posterior step, Crl might have been recruited in Enterobacteriaceae; because its taxonomical pattern constrained to this bacterial order, however further analysis are necessary. Conclusions Therefore, we suggest that the regulon Crl is highly flexible for phenotypic adaptation, probably as consequence of the diverse growth environments associated with all organisms in which members of this regulatory network are present. Electronic supplementary material The online version of this article (10.1186/s12864-019-5619-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- A Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 62210, Cuernavaca, Morelos, Mexico.
| | - E Pérez-Rueda
- Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Sede Mérida, Universidad Nacional Autónoma de México, Unidad Académica de Ciencias y Tecnología, 97302, Mérida, Yucatán, Mexico. .,Centro de Genómica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago, Chile.
| | - M Sánchez-Pérez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 62210, Cuernavaca, Morelos, Mexico
| | - D A Velázquez-Ramírez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 62210, Cuernavaca, Morelos, Mexico
| | - J Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, 62210, Cuernavaca, Morelos, Mexico
| |
Collapse
|
18
|
Salgado H, Martínez-Flores I, Bustamante VH, Alquicira-Hernández K, García-Sotelo JS, García-Alonso D, Collado-Vides J. Using RegulonDB, the Escherichia coli K-12 Gene Regulatory Transcriptional Network Database. ACTA ACUST UNITED AC 2019; 61:1.32.1-1.32.30. [PMID: 30040192 DOI: 10.1002/cpbi.43] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In RegulonDB, for over 25 years, we have been gathering knowledge by manual curation from original scientific literature on the regulation of transcription initiation and genome organization in transcription units of the Escherichia coli K-12 genome. This unit describes six basic protocols that can serve as a guiding introduction to the main content of the current version (v9.4) of this electronic resource. These protocols include general navigation as well as searching for specific objects such as genes, gene products, transcription units, promoters, transcription factors, coexpression, and genetic sensory response units or GENSOR Units. In these protocols, the user will find an initial introduction to the concepts pertinent to the protocol, the content obtained when performing the given navigation, and the necessary resources for carrying out the protocol. This easy-to-follow presentation should help anyone interested in quickly seeing all that is currently offered in RegulonDB, including position weight matrices of transcription factors, coexpression values based on published microarrays, and the GENSOR Units unique to RegulonDB that offer regulatory mechanisms in the context of their signals and metabolic consequences. © 2018 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Heladia Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Irma Martínez-Flores
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Víctor H Bustamante
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Kevin Alquicira-Hernández
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Jair S García-Sotelo
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Santiago de Querétaro, Querétaro, México
| | - Delfino García-Alonso
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| |
Collapse
|
19
|
Rioualen C, Charbonnier-Khamvongsa L, Collado-Vides J, van Helden J. Integrating Bacterial ChIP-seq and RNA-seq Data With SnakeChunks. ACTA ACUST UNITED AC 2019; 66:e72. [PMID: 30786165 DOI: 10.1002/cpbi.72] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Next-generation sequencing (NGS) is becoming a routine approach in most domains of the life sciences. To ensure reproducibility of results, there is a crucial need to improve the automation of NGS data processing and enable forthcoming studies relying on big datasets. Although user-friendly interfaces now exist, there remains a strong need for accessible solutions that allow experimental biologists to analyze and explore their results in an autonomous and flexible way. The protocols here describe a modular system that enable a user to compose and fine-tune workflows based on SnakeChunks, a library of rules for the Snakemake workflow engine. They are illustrated using a study combining ChIP-seq and RNA-seq to identify target genes of the global transcription factor FNR in Escherichia coli, which has the advantage that results can be compared with the most up-to-date collection of existing knowledge about transcriptional regulation in this model organism, extracted from the RegulonDB database. © 2019 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Claire Rioualen
- Aix-Marseille University, INSERM, Laboratory of Theory and Approaches of Genome Complexity (TAGC), Marseille, France.,Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México
| | - Lucie Charbonnier-Khamvongsa
- Aix-Marseille University, INSERM, Laboratory of Theory and Approaches of Genome Complexity (TAGC), Marseille, France
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, México.,Department of Biomedical Engineering, Boston University, Boston, Massachusetts
| | - Jacques van Helden
- Aix-Marseille University, INSERM, Laboratory of Theory and Approaches of Genome Complexity (TAGC), Marseille, France.,Institut Français de Bioinformatique (IFB), UMS 3601-CNRS, Université Paris-Saclay, Orsay, France
| |
Collapse
|
20
|
Karp PD, Ong WK, Paley S, Billington R, Caspi R, Fulcher C, Kothari A, Krummenacker M, Latendresse M, Midford PE, Subhraveti P, Gama-Castro S, Muñiz-Rascado L, Bonavides-Martinez C, Santos-Zavaleta A, Mackie A, Collado-Vides J, Keseler IM, Paulsen I. The EcoCyc Database. EcoSal Plus 2018; 8:10.1128/ecosalplus.ESP-0006-2018. [PMID: 30406744 PMCID: PMC6504970 DOI: 10.1128/ecosalplus.esp-0006-2018] [Citation(s) in RCA: 53] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Indexed: 01/28/2023]
Abstract
EcoCyc is a bioinformatics database available at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project is to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene product, metabolite, reaction, operon, and metabolic pathway. The database also includes information on E. coli gene essentiality and on nutrient conditions that do or do not support the growth of E. coli. The website and downloadable software contain tools for analysis of high-throughput data sets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc and can be executed via EcoCyc.org. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. This review outlines the data content of EcoCyc and of the procedures by which this content is generated.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Wai Kit Ong
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | - Ron Caspi
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Carol Fulcher
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Anamika Kothari
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | - Mario Latendresse
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Peter E Midford
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - César Bonavides-Martinez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Amanda Mackie
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Ingrid M Keseler
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Ian Paulsen
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| |
Collapse
|
21
|
Tierrafría VH, Mejía-Almonte C, Camacho-Zaragoza JM, Salgado H, Alquicira K, Ishida C, Gama-Castro S, Collado-Vides J. MCO: towards an ontology and unified vocabulary for a framework-based annotation of microbial growth conditions. Bioinformatics 2018; 35:856-864. [PMID: 30137210 PMCID: PMC7963087 DOI: 10.1093/bioinformatics/bty689] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2018] [Revised: 06/22/2018] [Accepted: 08/16/2018] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION A major component in increasing our understanding of the biology of an organism is the mapping of its genotypic potential into its phenotypic expression profiles. This mapping is executed by the machinery of gene regulation, which is essentially studied by changes in growth conditions. Although many efforts have been made to systematize the annotation of experimental conditions in microbiology, the available annotations are not based on a consistent and controlled vocabulary, making difficult the identification of biologically meaningful comparisons of knowledge derived from different experiments or laboratories. RESULTS We curated terms related to experimental conditions that affect gene expression in Escherichia coli K-12. Since this is the best-studied microorganism, the collected terms are the seed for the Microbial Conditions Ontology (MCO), a controlled and structured vocabulary that can be expanded to annotate microbial conditions in general. Moreover, we developed an annotation framework to describe experimental conditions, providing the foundation to identify regulatory networks that operate under particular conditions. AVAILABILITY AND IMPLEMENTATION As far as we know, MCO is the first ontology for growth conditions of any bacterial organism, and it is available at http://regulondb.ccg.unam.mx and https://github.com/microbial-conditions-ontology. Furthermore, we will disseminate MCO throughout the Open Biological and Biomedical Ontology (OBO) Foundry in order to set a standard for the annotation of gene expression data. This will enable comparison of data from diverse data sources. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - J M Camacho-Zaragoza
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - H Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - K Alquicira
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - C Ishida
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | | | | |
Collapse
|
22
|
Santos-Zavaleta A, Sánchez-Pérez M, Salgado H, Velázquez-Ramírez DA, Gama-Castro S, Tierrafría VH, Busby SJW, Aquino P, Fang X, Palsson BO, Galagan JE, Collado-Vides J. A unified resource for transcriptional regulation in Escherichia coli K-12 incorporating high-throughput-generated binding data into RegulonDB version 10.0. BMC Biol 2018; 16:91. [PMID: 30115066 PMCID: PMC6094552 DOI: 10.1186/s12915-018-0555-y] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 07/25/2018] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Our understanding of the regulation of gene expression has benefited from the availability of high-throughput technologies that interrogate the whole genome for the binding of specific transcription factors and gene expression profiles. In the case of widely used model organisms, such as Escherichia coli K-12, the new knowledge gained from these approaches needs to be integrated with the legacy of accumulated knowledge from genetic and molecular biology experiments conducted in the pre-genomic era in order to attain the deepest level of understanding possible based on the available data. RESULTS In this paper, we describe an expansion of RegulonDB, the database containing the rich legacy of decades of classic molecular biology experiments supporting what we know about gene regulation and operon organization in E. coli K-12, to include the genome-wide dataset collections from 32 ChIP and 19 gSELEX publications, in addition to around 60 genome-wide expression profiles relevant to the functional significance of these datasets and used in their curation. Three essential features for the integration of this information coming from different methodological approaches are: first, a controlled vocabulary within an ontology for precisely defining growth conditions; second, the criteria to separate elements with enough evidence to consider them involved in gene regulation from isolated transcription factor binding sites without such support; and third, an expanded computational model supporting this knowledge. Altogether, this constitutes the basis for adequately gathering and enabling the comparisons and integration needed to manage and access such wealth of knowledge. CONCLUSIONS This version 10.0 of RegulonDB is a first step toward what should become the unifying access point for current and future knowledge on gene regulation in E. coli K-12. Furthermore, this model platform and associated methodologies and criteria can be emulated for gathering knowledge on other microbial organisms.
Collapse
Affiliation(s)
- Alberto Santos-Zavaleta
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos México
| | - Mishael Sánchez-Pérez
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos México
| | - Heladia Salgado
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos México
| | | | - Socorro Gama-Castro
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos México
| | - Víctor H. Tierrafría
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos México
| | | | - Patricia Aquino
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts USA
| | - Xin Fang
- Department of Bioengineering, University of California San Diego, La Jolla, California USA
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, California USA
- Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - James E. Galagan
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts USA
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos México
- Department of Biomedical Engineering, Boston University, Boston, Massachusetts USA
| |
Collapse
|
23
|
Méndez-Cruz CF, Gama-Castro S, Mejía-Almonte C, Castillo-Villalba MP, Muñiz-Rascado LJ, Collado-Vides J. First steps in automatic summarization of transcription factor properties for RegulonDB: classification of sentences about structural domains and regulated processes. Database (Oxford) 2018; 2017:4237584. [PMID: 29220462 PMCID: PMC5737074 DOI: 10.1093/database/bax070] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/25/2016] [Accepted: 08/15/2017] [Indexed: 11/17/2022]
Abstract
The RegulonDB (http://regulondb.ccg.unam.mx) team generates manually elaborated summaries about transcription factors (TFs) of Escherichia coli K-12. These texts involve considerable effort, since they summarize a diverse collection of structural, mechanistic and physiological properties of TFs and, due to constant new research, ideally they require frequent updating. In natural language processing, several techniques for automatic summarization have been developed. Therefore, our proposal is to extract, by using those techniques, relevant information about TFs for assisting the curation and elaboration of the manual summaries. Here, we present the results of the automatic classification of sentences about the biological processes regulated by a TF and the information about the structural domains constituting the TF. We tested two classical classifiers, Naïve Bayes and Support Vector Machines (SVMs), with the sentences of the manual summaries as training data. The best classifier was an SVM employing lexical, grammatical, and terminological features (F-score, 0.8689). The sentences of articles analyzed by this classifier were frequently true, but many sentences were set aside (high precision with low recall); consequently, some improvement is required. Nevertheless, automatic summaries of complete articles about five TFs, generated with this classifier, included much of the relevant information of the summaries written by curators (high ROUGE-1 recall). In fact, a manual comparison confirmed that the best summary encompassed 100% of the relevant information. Hence, our empirical results suggest that our proposal is promising for covering more properties of TFs to generate suggested sentences with relevant information to help the curation work without losing quality.
Collapse
Affiliation(s)
- Carlos-Francisco Méndez-Cruz
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Socorro Gama-Castro
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Citlalli Mejía-Almonte
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Marco-Polo Castillo-Villalba
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Luis-José Muñiz-Rascado
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| | - Julio Collado-Vides
- Computational Genomics Program, Center for Genomic Sciences, National Autonomous University of Mexico, Av. Universidad, s/n, Colonia Chamilpa, Cuernavaca, Morelos 62100, Mexico
| |
Collapse
|
24
|
Rinaldi F, Lithgow O, Gama-Castro S, Solano H, Lopez A, Muñiz Rascado LJ, Ishida-Gutiérrez C, Méndez-Cruz CF, Collado-Vides J. Strategies towards digital and semi-automated curation in RegulonDB. Database (Oxford) 2017; 2017:3074784. [PMID: 28365731 PMCID: PMC5467564 DOI: 10.1093/database/bax012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2016] [Accepted: 01/30/2017] [Indexed: 02/03/2023]
Abstract
Experimentally generated biological information needs to be organized and structured in order to become meaningful knowledge. However, the rate at which new information is being published makes manual curation increasingly unable to cope. Devising new curation strategies that leverage upon data mining and text analysis is, therefore, a promising avenue to help life science databases to cope with the deluge of novel information. In this article, we describe the integration of text mining technologies in the curation pipeline of the RegulonDB database, and discuss how the process can enhance the productivity of the curators.
Specifically, a named entity recognition approach is used to pre-annotate terms referring to a set of domain entities which are potentially relevant for the curation process. The annotated documents are presented to the curator, who, thanks to a custom-designed interface, can select sentences containing specific types of entities, thus restricting the amount of text that needs to be inspected. Additionally, a module capable of computing semantic similarity between sentences across the entire collection of articles to be curated is being integrated in the system. We tested the module using three sets of scientific articles and six domain experts. All these improvements are gradually enabling us to obtain a high throughput curation process with the same quality as manual curation.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich.,Institute of Computational Linguistics, University of Zurich, Andreasstrasse 15, Zurich 8050, Switzerland
| | - Oscar Lithgow
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Socorro Gama-Castro
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Hilda Solano
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Alejandra Lopez
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Luis José Muñiz Rascado
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Cecilia Ishida-Gutiérrez
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Carlos-Francisco Méndez-Cruz
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| | - Julio Collado-Vides
- Swiss Institute of Bioinformatics, and Institute of Computational Linguistics, University of Zurich, 8050 Andreasstrasse 14, Zürich
| |
Collapse
|
25
|
Balderas-Martínez YI, Rinaldi F, Contreras G, Solano-Lira H, Sánchez-Pérez M, Collado-Vides J, Selman M, Pardo A. Improving biocuration of microRNAs in diseases: a case study in idiopathic pulmonary fibrosis. Database (Oxford) 2017; 2017:3748307. [PMID: 28605770 PMCID: PMC5467562 DOI: 10.1093/database/bax030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2016] [Accepted: 03/25/2017] [Indexed: 12/24/2022]
Abstract
MicroRNAs (miRNAs) are small and non-coding RNA molecules that inhibit gene expression posttranscriptionally. They play important roles in several biological processes, and in recent years there has been an interest in studying how they are related to the pathogenesis of diseases. Although there are already some databases that contain information for miRNAs and their relation with illnesses, their curation represents a significant challenge due to the amount of information that is being generated every day. In particular, respiratory diseases are poorly documented in databases, despite the fact that they are of increasing concern regarding morbidity, mortality and economic impacts. In this work, we present the results that we obtained in the BioCreative Interactive Track (IAT), using a semiautomatic approach for improving biocuration of miRNAs related to diseases. Our procedures will be useful to complement databases that contain this type of information. We adapted the OntoGene text mining pipeline and the ODIN curation system in a full-text corpus of scientific publications concerning one specific respiratory disease: idiopathic pulmonary fibrosis, the most common and aggressive of the idiopathic interstitial cases of pneumonia. We curated 823 miRNA text snippets and found a total of 246 miRNAs related to this disease based on our semiautomatic approach with the system OntoGene/ODIN. The biocuration throughput improved by a factor of 12 compared with traditional manual biocuration. A significant advantage of our semiautomatic pipeline is that it can be applied to obtain the miRNAs of all the respiratory diseases and offers the possibility to be used for other illnesses. Database URL http://odin.ccg.unam.mx/ODIN/bc2015-miRNA/.
Collapse
Affiliation(s)
- Yalbi Itzel Balderas-Martínez
- Facultad de Ciencias, Departamento Biología Celular, Universidad Nacional Autónoma de México, Ciudad Universitaria, Circuito Exterior s/n, Coyoacán, CP 04510, Ciudad de México, CDMX, México.,CONACYT-INER Ismael Cosío Villegas, Departamento Investigación, Calzada de Tlalpan 4502 Sección XVI, Tlalpan, CP Ciudad de México, CDMX, México
| | - Fabio Rinaldi
- Swiss Institute of Bioinformatics and Institute of Computational Linguistics, University of Zurich, Andreasstrasse 15, CH-8050 Zurich, Switzerland.,Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Gabriela Contreras
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Hilda Solano-Lira
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Mishael Sánchez-Pérez
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Julio Collado-Vides
- Center for Genomics Sciences, Computational Genomics Program, Universidad Nacional Autónoma de México, Av. Universidad s/n, Chamilpa, CP 62210, Cuernavaca, Morelos, México
| | - Moisés Selman
- Instituto Nacional de Enfermedades Respiratorias Ismael Cosío Villegas, Dirección de Investigación Calzada de Tlalpan 4502 Sección XVI, Tlalpan, CP Ciudad de México, CDMX, México
| | - Annie Pardo
- Facultad de Ciencias, Departamento Biología Celular, Universidad Nacional Autónoma de México, Ciudad Universitaria, Circuito Exterior s/n, Coyoacán, CP 04510, Ciudad de México, CDMX, México
| |
Collapse
|
26
|
Ledezma-Tejeida D, Ishida C, Collado-Vides J. Genome-Wide Mapping of Transcriptional Regulation and Metabolism Describes Information-Processing Units in Escherichia coli. Front Microbiol 2017; 8:1466. [PMID: 28824593 PMCID: PMC5540944 DOI: 10.3389/fmicb.2017.01466] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Accepted: 07/20/2017] [Indexed: 11/13/2022] Open
Abstract
In the face of changes in their environment, bacteria adjust gene expression levels and produce appropriate responses. The individual layers of this process have been widely studied: the transcriptional regulatory network describes the regulatory interactions that produce changes in the metabolic network, both of which are coordinated by the signaling network, but the interplay between them has never been described in a systematic fashion. Here, we formalize the process of detection and processing of environmental information mediated by individual transcription factors (TFs), utilizing a concept termed genetic sensory response units (GENSOR units), which are composed of four components: (1) a signal, (2) signal transduction, (3) genetic switch, and (4) a response. We used experimentally validated data sets from two databases to assemble a GENSOR unit for each of the 189 local TFs of Escherichia coli K-12 contained in the RegulonDB database. Further analysis suggested that feedback is a common occurrence in signal processing, and there is a gradient of functional complexity in the response mediated by each TF, as opposed to a one regulator/one pathway rule. Finally, we provide examples of other GENSOR unit applications, such as hypothesis generation, detailed description of cellular decision making, and elucidation of indirect regulatory mechanisms.
Collapse
Affiliation(s)
- Daniela Ledezma-Tejeida
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de MéxicoCuernavaca, Mexico
| | - Cecilia Ishida
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de MéxicoCuernavaca, Mexico
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de MéxicoCuernavaca, Mexico
| |
Collapse
|
27
|
Pannier L, Merino E, Marchal K, Collado-Vides J. Effect of genomic distance on coexpression of coregulated genes in E. coli. PLoS One 2017; 12:e0174887. [PMID: 28419102 PMCID: PMC5395161 DOI: 10.1371/journal.pone.0174887] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 03/16/2017] [Indexed: 12/26/2022] Open
Abstract
In prokaryotes, genomic distance is a feature that in addition to coregulation affects coexpression. Several observations, such as genomic clustering of highly coexpressed small regulons, support the idea that coexpression behavior of coregulated genes is affected by the distance between the coregulated genes. However, the specific contribution of distance in addition to coregulation in determining the degree of coexpression has not yet been studied systematically. In this work, we exploit the rich information in RegulonDB to study how the genomic distance between coregulated genes affects their degree of coexpression, measured by pairwise similarity of expression profiles obtained under a large number of conditions. We observed that, in general, coregulated genes display higher degrees of coexpression as they are more closely located on the genome. This contribution of genomic distance in determining the degree of coexpression was relatively small compared to the degree of coexpression that was determined by the tightness of the coregulation (degree of overlap of regulatory programs) but was shown to be evolutionary constrained. In addition, the distance effect was sufficient to guarantee coexpression of coregulated genes that are located at very short distances, irrespective of their tightness of coregulation. This is partly but definitely not always because the close distance is also the cause of the coregulation. In cases where it is not, we hypothesize that the effect of the distance on coexpression could be caused by the fact that coregulated genes closely located to each other are also relatively more equidistantly located from their common TF and therefore subject to more similar levels of TF molecules. The absolute genomic distance of the coregulated genes to their common TF-coding gene tends to be less important in determining the degree of coexpression. Our results pinpoint the importance of taking into account the combined effect of distance and coregulation when studying prokaryotic coexpression and transcriptional regulation.
Collapse
Affiliation(s)
- Lucia Pannier
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Enrique Merino
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Kathleen Marchal
- Department of Microbial and Molecular Systems, KU Leuven, Centre of Microbial and Plant Genetics, Leuven, Belgium
- Department of Plant Biotechnology and Bioinformatics, Ghent University, Technologiepark, Ghent, Belgium
- Department of Information Technology, Ghent University, IMinds, Ghent, Belgium
- Department of Genetics, University of Pretoria, Hatfield Campus, Pretoria, South Africa
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| |
Collapse
|
28
|
Rinaldi F, Lithgow O, Gama-Castro S, Solano H, López-Fuentes A, Muñiz Rascado LJ, Ishida-Gutiérrez C, Méndez-Cruz CF, Collado-Vides J. Strategies towards digital and semi-automated curation in RegulonDB. Database (Oxford) 2017; 2017:3737829. [PMID: 28605767 PMCID: PMC5467572 DOI: 10.1093/database/bax029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Affiliation(s)
- Fabio Rinaldi
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and.,Swiss Institute of Bioinformatics and Institute of Computational Linguistics, University of Zurich, Andreasstrasse 15, Zurich 8050, Switzerland
| | - Oscar Lithgow
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| | - Hilda Solano
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| | - Alejandra López-Fuentes
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| | - Luis José Muñiz Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| | - Cecilia Ishida-Gutiérrez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| | - Carlos-Francisco Méndez-Cruz
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and
| |
Collapse
|
29
|
Keseler IM, Mackie A, Santos-Zavaleta A, Billington R, Bonavides-Martínez C, Caspi R, Fulcher C, Gama-Castro S, Kothari A, Krummenacker M, Latendresse M, Muñiz-Rascado L, Ong Q, Paley S, Peralta-Gil M, Subhraveti P, Velázquez-Ramírez DA, Weaver D, Collado-Vides J, Paulsen I, Karp PD. The EcoCyc database: reflecting new knowledge about Escherichia coli K-12. Nucleic Acids Res 2016; 45:D543-D550. [PMID: 27899573 PMCID: PMC5210515 DOI: 10.1093/nar/gkw1003] [Citation(s) in RCA: 377] [Impact Index Per Article: 47.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2016] [Accepted: 11/07/2016] [Indexed: 12/16/2022] Open
Abstract
EcoCyc (EcoCyc.org) is a freely accessible, comprehensive database that collects and summarizes experimental data for Escherichia coli K-12, the best-studied bacterial model organism. New experimental discoveries about gene products, their function and regulation, new metabolic pathways, enzymes and cofactors are regularly added to EcoCyc. New SmartTable tools allow users to browse collections of related EcoCyc content. SmartTables can also serve as repositories for user- or curator-generated lists. EcoCyc now supports running and modifying E. coli metabolic models directly on the EcoCyc website.
Collapse
Affiliation(s)
- Ingrid M Keseler
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| | - Amanda Mackie
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW, Australia
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | | | - César Bonavides-Martínez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Ron Caspi
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| | - Carol Fulcher
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Anamika Kothari
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| | | | | | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Quang Ong
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| | - Suzanne Paley
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| | - Martin Peralta-Gil
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | | | - David A Velázquez-Ramírez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Daniel Weaver
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Ian Paulsen
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW, Australia
| | - Peter D Karp
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA
| |
Collapse
|
30
|
Martínez-Flores I, Pérez-Morales D, Sánchez-Pérez M, Paredes CC, Collado-Vides J, Salgado H, Bustamante VH. In silico clustering of Salmonella global gene expression data reveals novel genes co-regulated with the SPI-1 virulence genes through HilD. Sci Rep 2016; 6:37858. [PMID: 27886269 PMCID: PMC5122947 DOI: 10.1038/srep37858] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Accepted: 11/02/2016] [Indexed: 01/04/2023] Open
Abstract
A wide variety of Salmonella enterica serovars cause intestinal and systemic infections to humans and animals. Salmonella Patogenicity Island 1 (SPI-1) is a chromosomal region containing 39 genes that have crucial virulence roles. The AraC-like transcriptional regulator HilD, encoded in SPI-1, positively controls the expression of the SPI-1 genes, as well as of several other virulence genes located outside SPI-1. In this study, we applied a clustering method to the global gene expression data of S. enterica serovar Typhimurium from the COLOMBOS database; thus genes that show an expression pattern similar to that of SPI-1 genes were selected. This analysis revealed nine novel genes that are co-expressed with SPI-1, which are located in different chromosomal regions. Expression analyses and protein-DNA interaction assays showed regulation by HilD for six of these genes: gtgE, phoH, sinR, SL1263 (lpxR) and SL4247 were regulated directly, whereas SL1896 was regulated indirectly. Interestingly, phoH is an ancestral gene conserved in most of bacteria, whereas the other genes show characteristics of genes acquired by Salmonella. A role in virulence has been previously demonstrated for gtgE, lpxR and sinR. Our results further expand the regulon of HilD and thus identify novel possible Salmonella virulence genes.
Collapse
Affiliation(s)
- Irma Martínez-Flores
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Deyanira Pérez-Morales
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Mishael Sánchez-Pérez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Claudia C Paredes
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Heladia Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| | - Víctor H Bustamante
- Departamento de Microbiología Molecular, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, México
| |
Collapse
|
31
|
Moretto M, Sonego P, Dierckxsens N, Brilli M, Bianco L, Ledezma-Tejeida D, Gama-Castro S, Galardini M, Romualdi C, Laukens K, Collado-Vides J, Meysman P, Engelen K. COLOMBOS v3.0: leveraging gene expression compendia for cross-species analyses. Nucleic Acids Res 2015; 44:D620-3. [PMID: 26586805 PMCID: PMC4702885 DOI: 10.1093/nar/gkv1251] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Accepted: 11/01/2015] [Indexed: 01/29/2023] Open
Abstract
COLOMBOS is a database that integrates publicly available transcriptomics data for several prokaryotic model organisms. Compared to the previous version it has more than doubled in size, both in terms of species and data available. The manually curated condition annotation has been overhauled as well, giving more complete information about samples’ experimental conditions and their differences. Functionality-wise cross-species analyses now enable users to analyse expression data for all species simultaneously, and identify candidate genes with evolutionary conserved expression behaviour. All the expression-based query tools have undergone a substantial improvement, overcoming the limit of enforced co-expression data retrieval and instead enabling the return of more complex patterns of expression behaviour. COLOMBOS is freely available through a web application at http://colombos.net/. The complete database is also accessible via REST API or downloadable as tab-delimited text files.
Collapse
Affiliation(s)
- Marco Moretto
- Department of Computational Biology, Research and Innovation Center, Fondazione Edmund Mach, San Michele all'Adige, Trento (TN) 38010, Italy Department of Biology, University of Padova, Padova (PD) 35121, Italy
| | - Paolo Sonego
- Department of Computational Biology, Research and Innovation Center, Fondazione Edmund Mach, San Michele all'Adige, Trento (TN) 38010, Italy
| | - Nicolas Dierckxsens
- Interuniversity Institute of Bioinformatics Brussels (IB), ULB-VUB, Triomflaan CP 263, B-1050 Brussels, Belgium
| | - Matteo Brilli
- Department of Genomics and Biology of Fruit Crops, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all' Adige, Trento (TN) 38010, Italy
| | - Luca Bianco
- Department of Computational Biology, Research and Innovation Center, Fondazione Edmund Mach, San Michele all'Adige, Trento (TN) 38010, Italy
| | - Daniela Ledezma-Tejeida
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Marco Galardini
- EMBL-EBI, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK
| | - Chiara Romualdi
- Department of Biology, University of Padova, Padova (PD) 35121, Italy
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, B-2020 Antwerp, Belgium Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, B-2650 Edegem, Belgium
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico
| | - Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, B-2020 Antwerp, Belgium Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, B-2650 Edegem, Belgium
| | - Kristof Engelen
- Department of Computational Biology, Research and Innovation Center, Fondazione Edmund Mach, San Michele all'Adige, Trento (TN) 38010, Italy
| |
Collapse
|
32
|
Gama-Castro S, Salgado H, Santos-Zavaleta A, Ledezma-Tejeida D, Muñiz-Rascado L, García-Sotelo JS, Alquicira-Hernández K, Martínez-Flores I, Pannier L, Castro-Mondragón JA, Medina-Rivera A, Solano-Lira H, Bonavides-Martínez C, Pérez-Rueda E, Alquicira-Hernández S, Porrón-Sotelo L, López-Fuentes A, Hernández-Koutoucheva A, Del Moral-Chávez V, Rinaldi F, Collado-Vides J. RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res 2015; 44:D133-43. [PMID: 26527724 PMCID: PMC4702833 DOI: 10.1093/nar/gkv1156] [Citation(s) in RCA: 324] [Impact Index Per Article: 36.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2015] [Accepted: 10/19/2015] [Indexed: 01/28/2023] Open
Abstract
RegulonDB (http://regulondb.ccg.unam.mx) is one of the most useful and important resources on bacterial gene regulation,as it integrates the scattered scientific knowledge of the best-characterized organism, Escherichia coli K-12, in a database that organizes large amounts of data. Its electronic format enables researchers to compare their results with the legacy of previous knowledge and supports bioinformatics tools and model building. Here, we summarize our progress with RegulonDB since our last Nucleic Acids Research publication describing RegulonDB, in 2013. In addition to maintaining curation up-to-date, we report a collection of 232 interactions with small RNAs affecting 192 genes, and the complete repertoire of 189 Elementary Genetic Sensory-Response units (GENSOR units), integrating the signal, regulatory interactions, and metabolic pathways they govern. These additions represent major progress to a higher level of understanding of regulated processes. We have updated the computationally predicted transcription factors, which total 304 (184 with experimental evidence and 120 from computational predictions); we updated our position-weight matrices and have included tools for clustering them in evolutionary families. We describe our semiautomatic strategy to accelerate curation, including datasets from high-throughput experiments, a novel coexpression distance to search for ‘neighborhood’ genes to known operons and regulons, and computational developments.
Collapse
Affiliation(s)
- Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Heladia Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Daniela Ledezma-Tejeida
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Jair Santiago García-Sotelo
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Kevin Alquicira-Hernández
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Irma Martínez-Flores
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Lucia Pannier
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | | | - Alejandra Medina-Rivera
- Laboratorio Internacional de Investigación sobre el Genoma Humano, Universidad Nacional Autónoma de México, Campus Juriquilla, Boulevard Juriquilla 3001, Juriquilla 76230, Santiago de Querétaro, QRO, Mexico
| | - Hilda Solano-Lira
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - César Bonavides-Martínez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Ernesto Pérez-Rueda
- Departamento de Microbiologia Molecular, IBT, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62100, Mexico
| | - Shirley Alquicira-Hernández
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Liliana Porrón-Sotelo
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Alejandra López-Fuentes
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Anastasia Hernández-Koutoucheva
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Víctor Del Moral-Chávez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Binzmühlestrasse 14, CH-8050 Zurich, Switzerland
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, Mexico
| |
Collapse
|
33
|
Alvarez-Vasquez FJ, Freyre-González JA, Balderas-Martínez YI, Delgado-Carrillo MI, Collado-Vides J. Mathematical modeling of the apo and holo transcriptional regulation in Escherichia coli. Mol BioSyst 2015; 11:994-1003. [DOI: 10.1039/c4mb00561a] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Transcription factors can bind to DNA either with their effector bound (holo conformation), or as free proteins (apo conformation).
Collapse
Affiliation(s)
| | - Julio A. Freyre-González
- Evolutionary Genomics Program
- Center for Genomic Sciences
- Universidad Nacional Autónoma de México
- Cuernavaca
- Mexico
| | - Yalbi I. Balderas-Martínez
- Computational Genomics Program
- Center for Genomic Sciences
- Universidad Nacional Autónoma de México
- Cuernavaca
- Mexico
| | | | - Julio Collado-Vides
- Computational Genomics Program
- Center for Genomic Sciences
- Universidad Nacional Autónoma de México
- Cuernavaca
- Mexico
| |
Collapse
|
34
|
Gama-Castro S, Rinaldi F, López-Fuentes A, Balderas-Martínez YI, Clematide S, Ellendorff TR, Santos-Zavaleta A, Marques-Madeira H, Collado-Vides J. Assisted curation of regulatory interactions and growth conditions of OxyR in E. coli K-12. Database (Oxford) 2014; 2014:bau049. [PMID: 24903516 PMCID: PMC4207228 DOI: 10.1093/database/bau049] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Given the current explosion of data within original publications generated in the field of genomics, a recognized bottleneck is the transfer of such knowledge into comprehensive databases. We have for years organized knowledge on transcriptional regulation reported in the original literature of Escherichia coli K-12 into RegulonDB (http://regulondb.ccg.unam.mx), our database that is currently supported by >5000 papers. Here, we report a first step towards the automatic biocuration of growth conditions in this corpus. Using the OntoGene text-mining system (http://www.ontogene.org), we extracted and manually validated regulatory interactions and growth conditions in a new approach based on filters that enable the curator to select informative sentences from preprocessed full papers. Based on a set of 48 papers dealing with oxidative stress by OxyR, we were able to retrieve 100% of the OxyR regulatory interactions present in RegulonDB, including the transcription factors and their effect on target genes. Our strategy was designed to extract, as we did, their growth conditions. This result provides a proof of concept for a more direct and efficient curation process, and enables us to define the strategy of the subsequent steps to be implemented for a semi-automatic curation of original literature dealing with regulation of gene expression in bacteria. This project will enhance the efficiency and quality of the curation of knowledge present in the literature of gene regulation, and contribute to a significant increase in the encoding of the regulatory network of E. coli. RegulonDB Database URL:http://regulondb.ccg.unam.mx OntoGene URL:http://www.ontogene.org
Collapse
Affiliation(s)
- Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Fabio Rinaldi
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Alejandra López-Fuentes
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Yalbi Itzel Balderas-Martínez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Simon Clematide
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Tilia Renate Ellendorff
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Hernani Marques-Madeira
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100 and Institute of Computational Linguistics, University of Zurich, Binzmuhlestrasse 14, Zurich 8050, Switzerland
| |
Collapse
|
35
|
Karp PD, Weaver D, Paley S, Fulcher C, Kubo A, Kothari A, Krummenacker M, Subhraveti P, Weerasinghe D, Gama-Castro S, Huerta AM, Muñiz-Rascado L, Bonavides-Martinez C, Weiss V, Peralta-Gil M, Santos-Zavaleta A, Schröder I, Mackie A, Gunsalus R, Collado-Vides J, Keseler IM, Paulsen I. The EcoCyc Database. EcoSal Plus 2014; 6:10.1128/ecosalplus.ESP-0009-2013. [PMID: 26442933 PMCID: PMC4243172 DOI: 10.1128/ecosalplus.esp-0009-2013] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2014] [Indexed: 11/20/2022]
Abstract
EcoCyc is a bioinformatics database available at EcoCyc.org that describes the genome and the biochemical machinery of Escherichia coli K-12 MG1655. The long-term goal of the project is to describe the complete molecular catalog of the E. coli cell, as well as the functions of each of its molecular parts, to facilitate a system-level understanding of E. coli. EcoCyc is an electronic reference source for E. coli biologists and for biologists who work with related microorganisms. The database includes information pages on each E. coli gene, metabolite, reaction, operon, and metabolic pathway. The database also includes information on E. coli gene essentiality and on nutrient conditions that do or do not support the growth of E. coli. The website and downloadable software contain tools for analysis of high-throughput data sets. In addition, a steady-state metabolic flux model is generated from each new version of EcoCyc. The model can predict metabolic flux rates, nutrient uptake rates, and growth rates for different gene knockouts and nutrient conditions. This review provides a detailed description of the data content of EcoCyc and of the procedures by which this content is generated.
Collapse
Affiliation(s)
- Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Daniel Weaver
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Suzanne Paley
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Carol Fulcher
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Aya Kubo
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Anamika Kothari
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | | | | | | | - Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Araceli M Huerta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Luis Muñiz-Rascado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - César Bonavides-Martinez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Verena Weiss
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Martin Peralta-Gil
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Alberto Santos-Zavaleta
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Imke Schröder
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, CA 90095
- UCLA Institute of Genomics and Proteomics, University of California, Los Angeles, CA 90095
| | - Amanda Mackie
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| | - Robert Gunsalus
- Department of Microbiology, Immunology, and Molecular Genetics, University of California, Los Angeles, CA 90095
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100, México
| | - Ingrid M Keseler
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025
| | - Ian Paulsen
- Department of Chemistry and Biomolecular Sciences, Macquarie University, Sydney, NSW 2109, Australia
| |
Collapse
|
36
|
Meysman P, Collado-Vides J, Morett E, Viola R, Engelen K, Laukens K. Structural properties of prokaryotic promoter regions correlate with functional features. PLoS One 2014; 9:e88717. [PMID: 24516674 PMCID: PMC3918002 DOI: 10.1371/journal.pone.0088717] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Accepted: 01/10/2014] [Indexed: 12/31/2022] Open
Abstract
The structural properties of the DNA molecule are known to play a critical role in transcription. In this paper, the structural profiles of promoter regions were studied within the context of their diversity and their function for eleven prokaryotic species; Escherichia coli, Klebsiella pneumoniae, Salmonella Typhimurium, Pseudomonas auroginosa, Geobacter sulfurreducens Helicobacter pylori, Chlamydophila pneumoniae, Synechocystis sp., Synechoccocus elongates, Bacillus anthracis, and the archaea Sulfolobus solfataricus. The main anchor point for these promoter regions were transcription start sites identified through high-throughput experiments or collected within large curated databases. Prokaryotic promoter regions were found to be less stable and less flexible than the genomic mean across all studied species. However, direct comparison between species revealed differences in their structural profiles that can not solely be explained by the difference in genomic GC content. In addition, comparison with functional data revealed that there are patterns in the promoter structural profiles that can be linked to specific functional loci, such as sigma factor regulation or transcription factor binding. Interestingly, a novel structural element clearly visible near the transcription start site was found in genes associated with essential cellular functions and growth in several species. Our analyses reveals the great diversity in promoter structural profiles both between and within prokaryotic species. We observed relationships between structural diversity and functional features that are interesting prospects for further research to yet uncharacterized functional loci defined by DNA structural properties.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Enrique Morett
- Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
- Instituto Nacional de Medicina Genómica, Mexico City, Mexico
| | - Roberto Viola
- Department of Computational Biology, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Kristof Engelen
- Department of Computational Biology, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
- * E-mail: (KE); (KL)
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
- * E-mail: (KE); (KL)
| |
Collapse
|
37
|
Meysman P, Sonego P, Bianco L, Fu Q, Ledezma-Tejeida D, Gama-Castro S, Liebens V, Michiels J, Laukens K, Marchal K, Collado-Vides J, Engelen K. COLOMBOS v2.0: an ever expanding collection of bacterial expression compendia. Nucleic Acids Res 2013; 42:D649-53. [PMID: 24214998 PMCID: PMC3965013 DOI: 10.1093/nar/gkt1086] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
The COLOMBOS database (http://www.colombos.net) features comprehensive organism-specific cross-platform gene expression compendia of several bacterial model organisms and is supported by a fully interactive web portal and an extensive web API. COLOMBOS was originally published in PLoS One, and COLOMBOS v2.0 includes both an update of the expression data, by expanding the previously available compendia and by adding compendia for several new species, and an update of the surrounding functionality, with improved search and visualization options and novel tools for programmatic access to the database. The scope of the database has also been extended to incorporate RNA-seq data in our compendia by a dedicated analysis pipeline. We demonstrate the validity and robustness of this approach by comparing the same RNA samples measured in parallel using both microarrays and RNA-seq. As far as we know, COLOMBOS currently hosts the largest homogenized gene expression compendia available for seven bacterial model organisms.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, B-2020 Antwerp, Belgium, Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, B-2650 Edegem, Belgium, Department of Computational Biology, Research and Innovation Center, Fondazione Edmund Mach, San Michele all'Adige, Trento (TN) 38010, Italy, Department of Microbial and Molecular Sciences, KU Leuven, Leuven B-3001, Belgium, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico, Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent 9052, Belgium and Department of Information Technology, IMinds, Ghent University, Gent 9052, Belgium
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
38
|
Balderas-Martínez YI, Savageau M, Salgado H, Pérez-Rueda E, Morett E, Collado-Vides J. Transcription factors in Escherichia coli prefer the holo conformation. PLoS One 2013; 8:e65723. [PMID: 23776535 PMCID: PMC3680503 DOI: 10.1371/journal.pone.0065723] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2013] [Accepted: 04/26/2013] [Indexed: 11/18/2022] Open
Abstract
The transcriptional regulatory network of Escherichia coli K-12 is among the best studied gene networks of any living cell. Transcription factors bind to DNA either with their effector bound (holo conformation), or as a free protein (apo conformation) regulating transcription initiation. By using RegulonDB, the functional conformations (holo or apo) of transcription factors, and their mode of regulation (activator, repressor, or dual) were exhaustively analyzed. We report a striking discovery in the architecture of the regulatory network, finding a strong under-representation of the apo conformation (without allosteric metabolite) of transcription factors when binding to their DNA sites to activate transcription. This observation is supported at the level of individual regulatory interactions on promoters, even if we exclude the promoters regulated by global transcription factors, where three-quarters of the known promoters are regulated by a transcription factor in holo conformation. This genome-scale analysis enables us to ask what are the implications of these observations for the physiology and for our understanding of the ecology of E. coli. We discuss these ideas within the framework of the demand theory of gene regulation.
Collapse
Affiliation(s)
- Yalbi Itzel Balderas-Martínez
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
- * E-mail: (YIB-M); (JC-V)
| | - Michael Savageau
- Department of Biomedical Engineering, University of California Davis, Davis, California, United States of America
| | - Heladia Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Ernesto Pérez-Rueda
- Departamento de Ingeniería Celular y Biocatálisis, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Enrique Morett
- Departamento de Ingeniería Celular y Biocatálisis, Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
| | - Julio Collado-Vides
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, México
- * E-mail: (YIB-M); (JC-V)
| |
Collapse
|
39
|
Weiss V, Medina-Rivera A, Huerta AM, Santos-Zavaleta A, Salgado H, Morett E, Collado-Vides J. Evidence classification of high-throughput protocols and confidence integration in RegulonDB. Database (Oxford) 2013; 2013:bas059. [PMID: 23327937 PMCID: PMC3548332 DOI: 10.1093/database/bas059] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
RegulonDB provides curated information on the transcriptional regulatory network of Escherichia coli and contains both experimental data and computationally predicted objects. To account for the heterogeneity of these data, we introduced in version 6.0, a two-tier rating system for the strength of evidence, classifying evidence as either ‘weak’ or ‘strong’ (Gama-Castro,S., Jimenez-Jacinto,V., Peralta-Gil,M. et al. RegulonDB (Version 6.0): gene regulation model of Escherichia Coli K-12 beyond transcription, active (experimental) annotated promoters and textpresso navigation. Nucleic Acids Res., 2008;36:D120–D124.). We now add to our classification scheme the classification of high-throughput evidence, including chromatin immunoprecipitation (ChIP) and RNA-seq technologies. To integrate these data into RegulonDB, we present two strategies for the evaluation of confidence, statistical validation and independent cross-validation. Statistical validation involves verification of ChIP data for transcription factor-binding sites, using tools for motif discovery and quality assessment of the discovered matrices. Independent cross-validation combines independent evidence with the intention to mutually exclude false positives. Both statistical validation and cross-validation allow to upgrade subsets of data that are supported by weak evidence to a higher confidence level. Likewise, cross-validation of strong confidence data extends our two-tier rating system to a three-tier system by introducing a third confidence score ‘confirmed’. Database URL:http://regulondb.ccg.unam.mx/
Collapse
Affiliation(s)
- Verena Weiss
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, AP 565-A, Cuernavaca, Morelos 62100, Mexico.
| | | | | | | | | | | | | |
Collapse
|
40
|
Salgado H, Peralta-Gil M, Gama-Castro S, Santos-Zavaleta A, Muñiz-Rascado L, García-Sotelo JS, Weiss V, Solano-Lira H, Martínez-Flores I, Medina-Rivera A, Salgado-Osorio G, Alquicira-Hernández S, Alquicira-Hernández K, López-Fuentes A, Porrón-Sotelo L, Huerta AM, Bonavides-Martínez C, Balderas-Martínez YI, Pannier L, Olvera M, Labastida A, Jiménez-Jacinto V, Vega-Alvarado L, Del Moral-Chávez V, Hernández-Alvarez A, Morett E, Collado-Vides J. RegulonDB v8.0: omics data sets, evolutionary conservation, regulatory phrases, cross-validated gold standards and more. Nucleic Acids Res 2012. [PMID: 23203884 PMCID: PMC3531196 DOI: 10.1093/nar/gks1201] [Citation(s) in RCA: 351] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
This article summarizes our progress with RegulonDB (http://regulondb.ccg.unam.mx/) during the past 2 years. We have kept up-to-date the knowledge from the published literature regarding transcriptional regulation in Escherichia coli K-12. We have maintained and expanded our curation efforts to improve the breadth and quality of the encoded experimental knowledge, and we have implemented criteria for the quality of our computational predictions. Regulatory phrases now provide high-level descriptions of regulatory regions. We expanded the assignment of quality to various sources of evidence, particularly for knowledge generated through high-throughput (HT) technology. Based on our analysis of most relevant methods, we defined rules for determining the quality of evidence when multiple independent sources support an entry. With this latest release of RegulonDB, we present a new highly reliable larger collection of transcription start sites, a result of our experimental HT genome-wide efforts. These improvements, together with several novel enhancements (the tracks display, uploading format and curational guidelines), address the challenges of incorporating HT-generated knowledge into RegulonDB. Information on the evolutionary conservation of regulatory elements is also available now. Altogether, RegulonDB version 8.0 is a much better home for integrating knowledge on gene regulation from the sources of information currently available.
Collapse
Affiliation(s)
- Heladia Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, A.P. 565-A, Cuernavaca, Morelos 62100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
41
|
Keseler IM, Mackie A, Peralta-Gil M, Santos-Zavaleta A, Gama-Castro S, Bonavides-Martínez C, Fulcher C, Huerta AM, Kothari A, Krummenacker M, Latendresse M, Muñiz-Rascado L, Ong Q, Paley S, Schröder I, Shearer AG, Subhraveti P, Travers M, Weerasinghe D, Weiss V, Collado-Vides J, Gunsalus RP, Paulsen I, Karp PD. EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res 2012; 41:D605-12. [PMID: 23143106 PMCID: PMC3531154 DOI: 10.1093/nar/gks1027] [Citation(s) in RCA: 420] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
EcoCyc (http://EcoCyc.org) is a model organism database built on the genome sequence of Escherichia coli K-12 MG1655. Expert manual curation of the functions of individual E. coli gene products in EcoCyc has been based on information found in the experimental literature for E. coli K-12-derived strains. Updates to EcoCyc content continue to improve the comprehensive picture of E. coli biology. The utility of EcoCyc is enhanced by new tools available on the EcoCyc web site, and the development of EcoCyc as a teaching tool is increasing the impact of the knowledge collected in EcoCyc.
Collapse
Affiliation(s)
- Ingrid M Keseler
- SRI International, 333 Ravenswood Avenue, Menlo Park, CA 94025, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Pauling J, Röttger R, Neuner A, Salgado H, Collado-Vides J, Kalaghatgi P, Azevedo V, Tauch A, Pühler A, Baumbach J. On the trail of EHEC/EAEC--unraveling the gene regulatory networks of human pathogenic Escherichia coli bacteria. Integr Biol (Camb) 2012; 4:728-33. [PMID: 22318347 DOI: 10.1039/c2ib00132b] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Pathogenic Escherichia coli, such as Enterohemorrhagic E. coli (EHEC) and Enteroaggregative E. coli (EAEC), are globally widespread bacteria. Some may cause the hemolytic uremic syndrome (HUS). Varying strains cause epidemics all over the world. Recently, we observed an epidemic outbreak of a multi-resistant EHEC strain in Western Europe, mainly in Germany. The Robert Koch Institute reports >4300 infections and >50 deaths (July, 2011). Farmers lost several million EUR since the origin of infection was unclear. Here, we contribute to the currently ongoing research with a computer-aided study of EHEC transcriptional regulatory interactions, a network of genetic switches that control, for instance, pathogenicity, survival and reproduction of bacterial cells. Our strategy is to utilize knowledge of gene regulatory networks from the evolutionary relative E. coli K-12, a harmless strain mainly used for wet lab studies. In order to provide high-potential candidates for human pathogenic E. coli bacteria, such as EHEC, we developed the integrated online database and an analysis platform EhecRegNet. We utilize 3489 known regulations from E. coli K-12 for predictions of yet unknown gene regulatory interactions in 16 human pathogens. For these strains we predict 40,913 regulatory interactions. EhecRegNet is based on the identification of evolutionarily conserved regulatory sites within the DNA of the harmless E. coli K-12 and the pathogens. Identifying and characterizing EHEC's genetic control mechanism network on a large scale will allow for a better understanding of its survival and infection strategies. This will support the development of urgently needed new treatments. EhecRegNet is online via http://www.ehecregnet.de.
Collapse
Affiliation(s)
- Josch Pauling
- Computational Systems Biology, Max Planck Institute for Informatics, Germany
| | | | | | | | | | | | | | | | | | | |
Collapse
|
43
|
Salgado H, Martínez-Flores I, López-Fuentes A, García-Sotelo JS, Porrón-Sotelo L, Solano H, Muñiz-Rascado L, Collado-Vides J. Extracting regulatory networks of Escherichia coli from RegulonDB. Methods Mol Biol 2012; 804:179-195. [PMID: 22144154 DOI: 10.1007/978-1-61779-361-5_10] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
RegulonDB contains the largest and currently best-known data set on transcriptional regulation in a single free-living organism, that of Escherichia coli K-12 (Gama-Castro et al. Nucleic Acids Res 36:D120-D124, 2008). This organized knowledge has been the gold standard for the implementation of bioinformatic predictive methods on gene regulation in bacteria (Collado-Vides et al. J Bacteriol 191:23-31, 2009). Given the complexity of different types of interactions, the difficulty of visualizing in a single figure of the whole network, and the different uses of this knowledge, we are making available different views of the genetic network. This chapter describes case studies about how to access these views, via precomputed files, web services and SQL, including sigma-gene relationships corresponding to transcription of alternative RNA polymerase holoenzyme promoters; as well as, transcription factor (TF)-genes, TF-operons, TF-TF, and TF-regulon interactions. 17.
Collapse
Affiliation(s)
- Heladia Salgado
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | | | | | | | | | | | | | | |
Collapse
|
44
|
Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D'Eustachio P, Schaefer C, Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinath G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur Ö, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Reubenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Whirl-Carrillo M, Cheung KH, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Novère NL, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD. Erratum: Corrigendum: The BioPAX community standard for pathway data sharing. Nat Biotechnol 2010. [DOI: 10.1038/nbt1210-1308c] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
45
|
Keseler IM, Collado-Vides J, Santos-Zavaleta A, Peralta-Gil M, Gama-Castro S, Muñiz-Rascado L, Bonavides-Martinez C, Paley S, Krummenacker M, Altman T, Kaipa P, Spaulding A, Pacheco J, Latendresse M, Fulcher C, Sarker M, Shearer AG, Mackie A, Paulsen I, Gunsalus RP, Karp PD. EcoCyc: a comprehensive database of Escherichia coli biology. Nucleic Acids Res 2010; 39:D583-90. [PMID: 21097882 PMCID: PMC3013716 DOI: 10.1093/nar/gkq1143] [Citation(s) in RCA: 337] [Impact Index Per Article: 24.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
EcoCyc (http://EcoCyc.org) is a comprehensive model organism database for Escherichia coli K-12 MG1655. From the scientific literature, EcoCyc captures the functions of individual E. coli gene products; their regulation at the transcriptional, post-transcriptional and protein level; and their organization into operons, complexes and pathways. EcoCyc users can search and browse the information in multiple ways. Recent improvements to the EcoCyc Web interface include combined gene/protein pages and a Regulation Summary Diagram displaying a graphical overview of all known regulatory inputs to gene expression and protein activity. The graphical representation of signal transduction pathways has been updated, and the cellular and regulatory overviews were enhanced with new functionality. A specialized undergraduate teaching resource using EcoCyc is being developed.
Collapse
Affiliation(s)
- Ingrid M Keseler
- SRI International, 333 Ravenswood Ave., Menlo Park, CA 94025, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
46
|
Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muñiz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, García-Sotelo JS, López-Fuentes A, Porrón-Sotelo L, Alquicira-Hernández S, Medina-Rivera A, Martínez-Flores I, Alquicira-Hernández K, Martínez-Adame R, Bonavides-Martínez C, Miranda-Ríos J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Res 2010; 39:D98-105. [PMID: 21051347 PMCID: PMC3013702 DOI: 10.1093/nar/gkq1110] [Citation(s) in RCA: 246] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
RegulonDB (http://regulondb.ccg.unam.mx/) is the primary reference database of the best-known regulatory network of any free-living organism, that of Escherichia coli K-12. The major conceptual change since 3 years ago is an expanded biological context so that transcriptional regulation is now part of a unit that initiates with the signal and continues with the signal transduction to the core of regulation, modifying expression of the affected target genes responsible for the response. We call these genetic sensory response units, or Gensor Units. We have initiated their high-level curation, with graphic maps and superreactions with links to other databases. Additional connectivity uses expandable submaps. RegulonDB has summaries for every transcription factor (TF) and TF-binding sites with internal symmetry. Several DNA-binding motifs and their sizes have been redefined and relocated. In addition to data from the literature, we have incorporated our own information on transcription start sites (TSSs) and transcriptional units (TUs), obtained by using high-throughput whole-genome sequencing technologies. A new portable drawing tool for genomic features is also now available, as well as new ways to download the data, including web services, files for several relational database manager systems and text files including BioPAX format.
Collapse
Affiliation(s)
- Socorro Gama-Castro
- Programa de Genómica Computacional, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, AP 565-A, Cuernavaca, Morelos 62100, México
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
47
|
Medina-Rivera A, Abreu-Goodger C, Thomas-Chollier M, Salgado H, Collado-Vides J, van Helden J. Theoretical and empirical quality assessment of transcription factor-binding motifs. Nucleic Acids Res 2010; 39:808-24. [PMID: 20923783 PMCID: PMC3035439 DOI: 10.1093/nar/gkq710] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Position-specific scoring matrices (PSSMs) are routinely used to predict transcription factor (TF)-binding sites in genome sequences. However, their reliability to predict novel binding sites can be far from optimum, due to the use of a small number of training sites or the inappropriate choice of parameters when building the matrix or when scanning sequences with it. Measures of matrix quality such as E-value and information content rely on theoretical models, and may fail in the context of full genome sequences. We propose a method, implemented in the program ‘matrix-quality’, that combines theoretical and empirical score distributions to assess reliability of PSSMs for predicting TF-binding sites. We applied ‘matrix-quality’ to estimate the predictive capacity of matrices for bacterial, yeast and mouse TFs. The evaluation of matrices from RegulonDB revealed some poorly predictive motifs, and allowed us to quantify the improvements obtained by applying multi-genome motif discovery. Interestingly, the method reveals differences between global and specific regulators. It also highlights the enrichment of binding sites in sequence sets obtained from high-throughput ChIP-chip (bacterial and yeast TFs), and ChIP–seq and experiments (mouse TFs). The method presented here has many applications, including: selecting reliable motifs before scanning sequences; improving motif collections in TFs databases; evaluating motifs discovered using high-throughput data sets.
Collapse
Affiliation(s)
- Alejandra Medina-Rivera
- Centro de Ciencias Genomicas, Universidad Nacional Autónoma de México. Av. Universidad s/n. Cuernavaca, Col. Chamilpa, Morelos 62210, Mexico.
| | | | | | | | | | | |
Collapse
|
48
|
Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I, Wu G, D'Eustachio P, Schaefer C, Luciano J, Schacherer F, Martinez-Flores I, Hu Z, Jimenez-Jacinto V, Joshi-Tope G, Kandasamy K, Lopez-Fuentes AC, Mi H, Pichler E, Rodchenkov I, Splendiani A, Tkachev S, Zucker J, Gopinath G, Rajasimha H, Ramakrishnan R, Shah I, Syed M, Anwar N, Babur O, Blinov M, Brauner E, Corwin D, Donaldson S, Gibbons F, Goldberg R, Hornbeck P, Luna A, Murray-Rust P, Neumann E, Ruebenacker O, Reubenacker O, Samwald M, van Iersel M, Wimalaratne S, Allen K, Braun B, Whirl-Carrillo M, Cheung KH, Dahlquist K, Finney A, Gillespie M, Glass E, Gong L, Haw R, Honig M, Hubaut O, Kane D, Krupa S, Kutmon M, Leonard J, Marks D, Merberg D, Petri V, Pico A, Ravenscroft D, Ren L, Shah N, Sunshine M, Tang R, Whaley R, Letovksy S, Buetow KH, Rzhetsky A, Schachter V, Sobral BS, Dogrusoz U, McWeeney S, Aladjem M, Birney E, Collado-Vides J, Goto S, Hucka M, Le Novère N, Maltsev N, Pandey A, Thomas P, Wingender E, Karp PD, Sander C, Bader GD. The BioPAX community standard for pathway data sharing. Nat Biotechnol 2010; 28:935-42. [PMID: 20829833 PMCID: PMC3001121 DOI: 10.1038/nbt.1666] [Citation(s) in RCA: 432] [Impact Index Per Article: 30.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
BioPAX (Biological Pathway Exchange) is a standard language to represent biological pathways at the molecular and cellular level. Its major use is to facilitate the exchange of pathway data (http://www.biopax.org). Pathway data captures our understanding of biological processes, but its rapid growth necessitates development of databases and computational tools to aid interpretation. However, the current fragmentation of pathway information across many databases with incompatible formats presents barriers to its effective use. BioPAX solves this problem by making pathway data substantially easier to collect, index, interpret and share. BioPAX can represent metabolic and signaling pathways, molecular and genetic interactions and gene regulation networks. BioPAX was created through a community process. Through BioPAX, millions of interactions organized into thousands of pathways across many organisms, from a growing number of sources, are available. Thus, large amounts of pathway data are available in a computable form to support visualization, analysis and biological discovery.
Collapse
Affiliation(s)
- Emek Demir
- Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, New York, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
49
|
Abstract
Mapping global protein binding in the E. coli genome reveals extended domains of high protein occupancy. Genome-wide mapping of transcription factor-DNA interactions in bacterial chromosomes in vivo has begun to reveal global zones occupied by these factors that serve two purposes: compacting the bacterial DNA and influencing global programs of gene transcription.
Collapse
Affiliation(s)
- Agustino Martínez-Antonio
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, Irapuato, 36500, México.
| | | | | |
Collapse
|
50
|
Pérez AG, Angarica VE, Collado-Vides J, Vasconcelos ATR. From sequence to dynamics: the effects of transcription factor and polymerase concentration changes on activated and repressed promoters. BMC Mol Biol 2009; 10:92. [PMID: 19772633 PMCID: PMC2761915 DOI: 10.1186/1471-2199-10-92] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2009] [Accepted: 09/22/2009] [Indexed: 11/25/2022] Open
Abstract
Background The fine tuning of two features of the bacterial regulatory machinery have been known to contribute to the diversity of gene expression within the same regulon: the sequence of Transcription Factor (TF) binding sites, and their location with respect to promoters. While variations of binding sequences modulate the strength of the interaction between the TF and its binding sites, the distance between binding sites and promoters alter the interaction between the TF and the RNA polymerase (RNAP). Results In this paper we estimated the dissociation constants (Kd) of several E. coli TFs in their interaction with variants of their binding sequences from the scores resulting from aligning them to Positional Weight Matrices. A correlation coefficient of 0.78 was obtained when pooling together sites for different TFs. The theoretically estimated Kd values were then used, together with the dissociation constants of the RNAP-promoter interaction to analyze activated and repressed promoters. The strength of repressor sites -- i.e., the strength of the interaction between TFs and their binding sites -- is slightly higher than that of activated sites. We explored how different factors such as the variation of binding sequences, the occurrence of more than one binding site, or different RNAP concentrations may influence the promoters' response to the variations of TF concentrations. We found that the occurrence of several regulatory sites bound by the same TF close to a promoter -- if they are bound by the TF in an independent manner -- changes the effect of TF concentrations on promoter occupancy, with respect to individual sites. We also found that the occupancy of a promoter will never be more than half if the RNAP concentration-to-Kp ratio is 1 and the promoter is subject to repression; or less than half if the promoter is subject to activation. If the ratio falls to 0.1, the upper limit of occupancy probability for repressed drops below 10%; a descent of the limits occurs also for activated promoters. Conclusion The number of regulatory sites may thus act as a versatility-producing device, in addition to serving as a source of robustness of the transcription machinery. Furthermore, our results show that the effects of TF concentration fluctuations on promoter occupancy are constrained by RNAP concentrations.
Collapse
Affiliation(s)
- Abel González Pérez
- Centro Nacional de Bioinformática, Industria y San José, Capitolio Nacional, CP 10200, Habana Vieja, Ciudad de la Habana, Cuba.
| | | | | | | |
Collapse
|