1
|
Westmann CA, Alves LDF, Silva-Rocha R, Guazzaroni ME. Mining Novel Constitutive Promoter Elements in Soil Metagenomic Libraries in Escherichia coli. Front Microbiol 2018; 9:1344. [PMID: 29973927 PMCID: PMC6019500 DOI: 10.3389/fmicb.2018.01344] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2018] [Accepted: 05/31/2018] [Indexed: 11/13/2022] Open
Abstract
Although functional metagenomics has been widely employed for the discovery of genes relevant to biotechnology and biomedicine, its potential for assessing the diversity of transcriptional regulatory elements of microbial communities has remained poorly explored. Here, we experimentally mined novel constitutive promoter sequences in metagenomic libraries by combining a bi-directional reporter vector, high-throughput fluorescence assays and predictive computational methods. Through the expression profiling of fluorescent clones from two independent soil sample libraries, we have analyzed the regulatory dynamics of 260 clones with candidate promoters as a set of active metagenomic promoters in the host Escherichia coli. Through an in-depth analysis of selected clones, we were able to further explore the architecture of metagenomic fragments and to report the presence of multiple promoters per fragment with a dominant promoter driving the expression profile. These approaches resulted in the identification of 33 novel active promoters from metagenomic DNA originated from very diverse phylogenetic groups. The in silico and in vivo analysis of these individual promoters allowed the generation of a constitutive promoter consensus for exogenous sequences recognizable by E. coli in metagenomic studies. The results presented here demonstrates the potential of functional metagenomics for exploring environmental bacterial communities as a source of novel regulatory genetic parts to expand the toolbox for microbial engineering.
Collapse
Affiliation(s)
- Cauã A Westmann
- Department of Cellular and Molecular Biology, FMRP, University of São Paulo, Ribeirão Preto, Brazil
| | - Luana de Fátima Alves
- Department of Biology, FFCLRP, University of São Paulo, Ribeirão Preto, Brazil.,Department of Biochemistry, FMRP, University of São Paulo, Ribeirão Preto, Brazil
| | - Rafael Silva-Rocha
- Department of Cellular and Molecular Biology, FMRP, University of São Paulo, Ribeirão Preto, Brazil
| | | |
Collapse
|
2
|
Kılıç S, Sagitova DM, Wolfish S, Bely B, Courtot M, Ciufo S, Tatusova T, O'Donovan C, Chibucos MC, Martin MJ, Erill I. From data repositories to submission portals: rethinking the role of domain-specific databases in CollecTF. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw055. [PMID: 27114493 PMCID: PMC4843526 DOI: 10.1093/database/baw055] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 03/20/2016] [Indexed: 11/12/2022]
Abstract
Domain-specific databases are essential resources for the biomedical community, leveraging expert knowledge to curate published literature and provide access to referenced data and knowledge. The limited scope of these databases, however, poses important challenges on their infrastructure, visibility, funding and usefulness to the broader scientific community. CollecTF is a community-oriented database documenting experimentally validated transcription factor (TF)-binding sites in the Bacteria domain. In its quest to become a community resource for the annotation of transcriptional regulatory elements in bacterial genomes, CollecTF aims to move away from the conventional data-repository paradigm of domain-specific databases. Through the adoption of well-established ontologies, identifiers and collaborations, CollecTF has progressively become also a portal for the annotation and submission of information on transcriptional regulatory elements to major biological sequence resources (RefSeq, UniProtKB and the Gene Ontology Consortium). This fundamental change in database conception capitalizes on the domain-specific knowledge of contributing communities to provide high-quality annotations, while leveraging the availability of stable information hubs to promote long-term access and provide high-visibility to the data. As a submission portal, CollecTF generates TF-binding site information through direct annotation of RefSeq genome records, definition of TF-based regulatory networks in UniProtKB entries and submission of functional annotations to the Gene Ontology. As a database, CollecTF provides enhanced search and browsing, targeted data exports, binding motif analysis tools and integration with motif discovery and search platforms. This innovative approach will allow CollecTF to focus its limited resources on the generation of high-quality information and the provision of specialized access to the data.Database URL: http://www.collectf.org/.
Collapse
Affiliation(s)
- Sefa Kılıç
- Department of Biological Sciences, University of Maryland Baltimore County (UMBC), 1000 Hilltop Circle, Baltimore, MD, 21250, USA
| | - Dinara M Sagitova
- Department of Biological Sciences, University of Maryland Baltimore County (UMBC), 1000 Hilltop Circle, Baltimore, MD, 21250, USA
| | - Shoshannah Wolfish
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| | - Benoit Bely
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Mélanie Courtot
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Stacy Ciufo
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, Rockville Pike, Bethesda, MD, 20894, USA
| | - Tatiana Tatusova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, Rockville Pike, Bethesda, MD, 20894, USA
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Marcus C Chibucos
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, USA Department of Microbiology and Immunology, University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County (UMBC), 1000 Hilltop Circle, Baltimore, MD, 21250, USA
| |
Collapse
|
3
|
Abbas MM, Mohie-Eldin MM, EL-Manzalawy Y. Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors. PLoS One 2015; 10:e0119721. [PMID: 25803493 PMCID: PMC4372424 DOI: 10.1371/journal.pone.0119721] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2014] [Accepted: 01/26/2015] [Indexed: 11/27/2022] Open
Abstract
As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.
Collapse
Affiliation(s)
- Mostafa M. Abbas
- KINDI Center for Computing Research, College of Engineering, Qatar University, Doha, Qatar
| | | | - Yasser EL-Manzalawy
- Systems and Computer Engineering, Al-Azhar University, Cairo, Egypt
- College of Information Sciences, Penn State University, University Park, United States of America
| |
Collapse
|
4
|
Azmi AM, Al-Ssulami A. Encoded expansion: an efficient algorithm to discover identical string motifs. PLoS One 2014; 9:e95148. [PMID: 24871320 PMCID: PMC4037181 DOI: 10.1371/journal.pone.0095148] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 03/24/2014] [Indexed: 11/19/2022] Open
Abstract
A major task in computational biology is the discovery of short recurring string patterns known as motifs. Most of the schemes to discover motifs are either stochastic or combinatorial in nature. Stochastic approaches do not guarantee finding the correct motifs, while the combinatorial schemes tend to have an exponential time complexity with respect to motif length. To alleviate the cost, the combinatorial approach exploits dynamic data structures such as trees or graphs. Recently (Karci (2009) Efficient automatic exact motif discovery algorithms for biological sequences, Expert Systems with Applications 36:7952-7963) devised a deterministic algorithm that finds all the identical copies of string motifs of all sizes [Formula: see text] in theoretical time complexity of [Formula: see text] and a space complexity of [Formula: see text] where [Formula: see text] is the length of the input sequence and [Formula: see text] is the length of the longest possible string motif. In this paper, we present a significant improvement on Karci's original algorithm. The algorithm that we propose reports all identical string motifs of sizes [Formula: see text] that occur at least [Formula: see text] times. Our algorithm starts with string motifs of size 2, and at each iteration it expands the candidate string motifs by one symbol throwing out those that occur less than [Formula: see text] times in the entire input sequence. We use a simple array and data encoding to achieve theoretical worst-case time complexity of [Formula: see text] and a space complexity of [Formula: see text] Encoding of the substrings can speed up the process of comparison between string motifs. Experimental results on random and real biological sequences confirm that our algorithm has indeed a linear time complexity and it is more scalable in terms of sequence length than the existing algorithms.
Collapse
Affiliation(s)
- Aqil M. Azmi
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
- * E-mail:
| | - Abdulrakeeb Al-Ssulami
- Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
5
|
Kılıç S, White ER, Sagitova DM, Cornish JP, Erill I. CollecTF: a database of experimentally validated transcription factor-binding sites in Bacteria. Nucleic Acids Res 2013; 42:D156-60. [PMID: 24234444 PMCID: PMC3965012 DOI: 10.1093/nar/gkt1123] [Citation(s) in RCA: 70] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The influx of high-throughput data and the need for complex models to describe the interaction of prokaryotic transcription factors (TF) with their target sites pose new challenges for TF-binding site databases. CollecTF (http://collectf.umbc.edu) compiles data on experimentally validated, naturally occurring TF-binding sites across the Bacteria domain, placing a strong emphasis on the transparency of the curation process, the quality and availability of the stored data and fully customizable access to its records. CollecTF integrates multiple sources of data automatically and openly, allowing users to dynamically redefine binding motifs and their experimental support base. Data quality and currency are fostered in CollecTF by adopting a sustainable model that encourages direct author submissions in combination with in-house validation and curation of published literature. CollecTF entries are periodically submitted to NCBI for integration into RefSeq complete genome records as link-out features, maximizing the visibility of the data and enriching the annotation of RefSeq files with regulatory information. Seeking to facilitate comparative genomics and machine-learning analyses of regulatory interactions, in its initial release CollecTF provides domain-wide coverage of two TF families (LexA and Fur), as well as extensive representation for a clinically important bacterial family, the Vibrionaceae.
Collapse
Affiliation(s)
| | | | | | | | - Ivan Erill
- *To whom correspondence should be addressed. Tel: +1 410 455 2470; Fax: +1 410 455 3875;
| |
Collapse
|
6
|
Kamzolova S, Beskaravainy P, Osypov A, Dzhelyadin T, Temlyakova E, Sorokin A. Electrostatic map of T7 DNA: comparative analysis of functional and electrostatic properties of T7 RNA polymerase-specific promoters. J Biomol Struct Dyn 2013; 32:1184-92. [DOI: 10.1080/07391102.2013.819298] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
7
|
Mishra H, Singh N, Misra K, Lahiri T. An ANN-GA model based promoter prediction in Arabidopsis thaliana using tilling microarray data. Bioinformation 2011; 6:240-3. [PMID: 21887014 PMCID: PMC3159145 DOI: 10.6026/97320630006240] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2011] [Accepted: 05/09/2011] [Indexed: 11/23/2022] Open
Abstract
Identification of promoter region is an important part of gene annotation. Identification of promoters in eukaryotes is important as promoters modulate various
metabolic functions and cellular stress responses. In this work, a novel approach utilizing intensity values of tilling microarray data for a model eukaryotic plant
Arabidopsis thaliana, was used to specify promoter region from non-promoter region. A feed-forward back propagation neural network model supported by
genetic algorithm was employed to predict the class of data with a window size of 41. A dataset comprising of 2992 data vectors representing both promoter and
non-promoter regions, chosen randomly from probe intensity vectors for whole genome of Arabidopsis thaliana generated through tilling microarray technique
was used. The classifier model shows prediction accuracy of 69.73% and 65.36% on training and validation sets, respectively. Further, a concept of distance based
class membership was used to validate reliability of classifier, which showed promising results. The study shows the usability of micro-array probe intensities to
predict the promoter regions in eukaryotic genomes.
Collapse
Affiliation(s)
- Hrishikesh Mishra
- Division of Applied Sciences and Indo-Russian Centre for Biotechnology, Indian Institute of Information Technology, Allahabad, India
| | | | | | | |
Collapse
|
8
|
Vingron M, Brazma A, Coulson R, van Helden J, Manke T, Palin K, Sand O, Ukkonen E. Integrating sequence, evolution and functional genomics in regulatory genomics. Genome Biol 2009; 10:202. [PMID: 19226437 PMCID: PMC2687781 DOI: 10.1186/gb-2009-10-1-202] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
With genome analysis expanding from the study of genes to the study of gene regulation, 'regulatory genomics' utilizes sequence information, evolution and functional genomics measurements to unravel how regulatory information is encoded in the genome.
Collapse
Affiliation(s)
- Martin Vingron
- Computational Molecular Biology, Max-Planck-Institut für molekulare Genetik, Berlin, Germany.
| | | | | | | | | | | | | | | |
Collapse
|
9
|
Rangannan V, Bansal M. Relative stability of DNA as a generic criterion for promoter prediction: whole genome annotation of microbial genomes with varying nucleotide base composition. MOLECULAR BIOSYSTEMS 2009; 5:1758-69. [DOI: 10.1039/b906535k] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
10
|
Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences. Nat Protoc 2008; 3:1589-603. [PMID: 18802440 DOI: 10.1038/nprot.2008.98] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
This protocol explains how to discover functional signals in genomic sequences by detecting over- or under-represented oligonucleotides (words) or spaced pairs thereof (dyads) with the Regulatory Sequence Analysis Tools (http://rsat.ulb.ac.be/rsat/). Two typical applications are presented: (i) predicting transcription factor-binding motifs in promoters of coregulated genes and (ii) discovering phylogenetic footprints in promoters of orthologous genes. The steps of this protocol include purging genomic sequences to discard redundant fragments, discovering over-represented patterns and assembling them to obtain degenerate motifs, scanning sequences and drawing feature maps. The main strength of the method is its statistical ground: the binomial significance provides an efficient control on the rate of false positives. In contrast with optimization-based pattern discovery algorithms, the method supports the detection of under- as well as over-represented motifs. Computation times vary from seconds (gene clusters) to minutes (whole genomes). The execution of the whole protocol should take approximately 1 h.
Collapse
|
11
|
Rivière R, Barth D, Cohen J, Denise A. Shuffling biological sequences with motif constraints. ACTA ACUST UNITED AC 2008. [DOI: 10.1016/j.jda.2007.06.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
12
|
Dekhtyar M, Morin A, Sakanyan V. Triad pattern algorithm for predicting strong promoter candidates in bacterial genomes. BMC Bioinformatics 2008; 9:233. [PMID: 18471287 PMCID: PMC2412878 DOI: 10.1186/1471-2105-9-233] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2007] [Accepted: 05/09/2008] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Bacterial promoters, which increase the efficiency of gene expression, differ from other promoters by several characteristics. This difference, not yet widely exploited in bioinformatics, looks promising for the development of relevant computational tools to search for strong promoters in bacterial genomes. RESULTS We describe a new triad pattern algorithm that predicts strong promoter candidates in annotated bacterial genomes by matching specific patterns for the group I sigma70 factors of Escherichia coli RNA polymerase. It detects promoter-specific motifs by consecutively matching three patterns, consisting of an UP-element, required for interaction with the alpha subunit, and then optimally-separated patterns of -35 and -10 boxes, required for interaction with the sigma70 subunit of RNA polymerase. Analysis of 43 bacterial genomes revealed that the frequency of candidate sequences depends on the A+T content of the DNA under examination. The accuracy of in silico prediction was experimentally validated for the genome of a hyperthermophilic bacterium, Thermotoga maritima, by applying a cell-free expression assay using the predicted strong promoters. In this organism, the strong promoters govern genes for translation, energy metabolism, transport, cell movement, and other as-yet unidentified functions. CONCLUSION The triad pattern algorithm developed for predicting strong bacterial promoters is well suited for analyzing bacterial genomes with an A+T content of less than 62%. This computational tool opens new prospects for investigating global gene expression, and individual strong promoters in bacteria of medical and/or economic significance.
Collapse
Affiliation(s)
| | - Amelie Morin
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
| | - Vehary Sakanyan
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
- ProtNeteomix, 2 rue de la Houssinière, 44322 Nantes, France
| |
Collapse
|
13
|
Vinga S, Almeida JS. Local Renyi entropic profiles of DNA sequences. BMC Bioinformatics 2007; 8:393. [PMID: 17939871 PMCID: PMC2238722 DOI: 10.1186/1471-2105-8-393] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2007] [Accepted: 10/16/2007] [Indexed: 11/18/2022] Open
Abstract
Background In a recent report the authors presented a new measure of continuous entropy for DNA sequences, which allows the estimation of their randomness level. The definition therein explored was based on the Rényi entropy of probability density estimation (pdf) using the Parzen's window method and applied to Chaos Game Representation/Universal Sequence Maps (CGR/USM). Subsequent work proposed a fractal pdf kernel as a more exact solution for the iterated map representation. This report extends the concepts of continuous entropy by defining DNA sequence entropic profiles using the new pdf estimations to refine the density estimation of motifs. Results The new methodology enables two results. On the one hand it shows that the entropic profiles are directly related with the statistical significance of motifs, allowing the study of under and over-representation of segments. On the other hand, by spanning the parameters of the kernel function it is possible to extract important information about the scale of each conserved DNA region. The computational applications, developed in Matlab m-code, the corresponding binary executables and additional material and examples are made publicly available at . Conclusion The ability to detect local conservation from a scale-independent representation of symbolic sequences is particularly relevant for biological applications where conserved motifs occur in multiple, overlapping scales, with significant future applications in the recognition of foreign genomic material and inference of motif structures.
Collapse
Affiliation(s)
- Susana Vinga
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R, Alves Redol 9, 1000-029 Lisboa, Portugal.
| | | |
Collapse
|
14
|
Nakata K, Tanaka Y, Nakano T, Adachi T, Tanaka H, Kaminuma T, Ishikawa T. Nuclear receptor-mediated transcriptional regulation in Phase I, II, and III xenobiotic metabolizing systems. Drug Metab Pharmacokinet 2007; 21:437-57. [PMID: 17220560 DOI: 10.2133/dmpk.21.437] [Citation(s) in RCA: 146] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Studies of the genetic regulation involved in drug metabolizing enzymes and drug transporters are of great interest to understand the molecular mechanisms of drug response and toxic events. Recent reports have revealed that hydrophobic ligands and several nuclear receptors are involved in the induction or down-regulation of various enzymes and transporters involved in Phase I, II, and III xenobiotic metabolizing systems. Nuclear receptors (NRs) form a family of ligand-activated transcription factors (TFs). These proteins modulate the regulation of target genes by contacting their promoter or enhancer sequences at specific recognition sites. These target genes include metabolizing enzymes such as cytochrome P450s (CYPs), transporters, and NRs. Thus it was now recognized that these NRs play essential role in sensing processing xenobiotic substances including drugs, environmental chemical pollutants and nutritional ingredients. From literature, we picked up target genes of each NR in xenobiotic response systems. Possible cross-talk, by which xenobiotics may exert undesirable effects, was listed. For example, the role of NRs was comprehensively drawn up in cholesterol and bile acid homeostasis in human hepatocyte. Summarizing current states of related research, especially for in silico response element search, we tried to elucidate nuclear receptor mediated xenobiotic processing loops and direct future research.
Collapse
|
15
|
Almeida JS, Vinga S. Computing distribution of scale independent motifs in biological sequences. Algorithms Mol Biol 2006; 1:18. [PMID: 17049089 PMCID: PMC1630425 DOI: 10.1186/1748-7188-1-18] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2006] [Accepted: 10/18/2006] [Indexed: 11/10/2022] Open
Abstract
The use of Chaos Game Representation (CGR) or its generalization, Universal Sequence Maps (USM), to describe the distribution of biological sequences has been found objectionable because of the fractal structure of that coordinate system. Consequently, the investigation of distribution of symbolic motifs at multiple scales is hampered by an inexact association between distance and sequence dissimilarity. A solution to this problem could unleash the use of iterative maps as phase-state representation of sequences where its statistical properties can be conveniently investigated. In this study a family of kernel density functions is described that accommodates the fractal nature of iterative function representations of symbolic sequences and, consequently, enables the exact investigation of sequence motifs of arbitrary lengths in that scale-independent representation. Furthermore, the proposed kernel density includes both Markovian succession and currently used alignment-free sequence dissimilarity metrics as special solutions. Therefore, the fractal kernel described is in fact a generalization that provides a common framework for a diverse suite of sequence analysis techniques.
Collapse
Affiliation(s)
- Jonas S Almeida
- Dept Biostatistics and Applied Mathematics, Univ. Texas MDAnderson Cancer Center, 1515 Holcombe Blvd, Houston TX 77030-4009, USA
| | - Susana Vinga
- Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), R. Alves Redol 9, 1000-029 Lisboa, Portugal
- Departamento de Bioestatística e Informática, Faculdade de Ciências Médicas – Universidade Nova de Lisboa (FCM/UNL), Campo dos Mártires da Pátria 130, 1169-056 Lisboa, Portugal
| |
Collapse
|
16
|
Carvalho AM, Freitas AT, Oliveira AL, Sagot MF. An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:126-40. [PMID: 17048399 DOI: 10.1109/tcbb.2006.16] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the data set sequences. This type of conserved regions, called structured motifs, is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The complexity analysis shows a time and space gain over the best known exact algorithms that is exponential in the spacings between binding sites. A full implementation of the algorithm was developed and made available online. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than four orders of magnitude. The application of the method to biological data sets shows its ability to extract relevant consensi.
Collapse
|
17
|
Abstract
The strongest signal of plant promoter is searched with the model of single motif with two types. It turns out that the dominant type is the TATA-box. The other type may be called TATA-less signal, and may be used in gene finders for promoter recognition. While the TATA signals are very close for the monocot and the dicot, their TATA-less signals are significantly different. A general and flexible multi-motif model is also proposed for promoter analysis based on dynamic programming. By extending the Gibbs sampler to the dynamic programming and introducing temperature, an efficient algorithm is developed for searching signals in plant promoters.
Collapse
|
18
|
Pisanti N, Crochemore M, Grossi R, Sagot MF. Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:40-50. [PMID: 17044163 DOI: 10.1109/tcbb.2005.5] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus, smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed.
Collapse
Affiliation(s)
- Nadia Pisanti
- Dipartimento di Informatica, Université di Pisa, Italy.
| | | | | | | |
Collapse
|
19
|
Vinga S, Almeida JS. Rényi continuous entropy of DNA sequences. J Theor Biol 2004; 231:377-88. [PMID: 15501469 DOI: 10.1016/j.jtbi.2004.06.030] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2004] [Accepted: 06/30/2004] [Indexed: 11/20/2022]
Abstract
Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.
Collapse
Affiliation(s)
- Susana Vinga
- Biomathematics Group, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, R. Qta. Grande 6, 2780-156 Oeiras, Portugal.
| | | |
Collapse
|
20
|
Wang H, Noordewier M, Benham CJ. Stress-induced DNA duplex destabilization (SIDD) in the E. coli genome: SIDD sites are closely associated with promoters. Genome Res 2004; 14:1575-84. [PMID: 15289476 PMCID: PMC509266 DOI: 10.1101/gr.2080004] [Citation(s) in RCA: 62] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
We present the first analysis of stress-induced DNA duplex destabilization (SIDD) in a complete chromosome, the Escherichia coli K12 genome. We used a newly developed method to calculate the locations and extents of stress-induced destabilization to single-base resolution at superhelix density sigma = -0.06. We find that SIDD sites in this genome show a statistically highly significant tendency to avoid coding regions. And among intergenic regions, those that either contain documented promoters or occur between divergently transcribing coding regions, and hence may be inferred to contain promoters, are associated with strong SIDD sites in a statistically highly significant manner. Intergenic regions located between convergently transcribing genes, which are inferred not to contain promoters, are not significantly enriched for destabilized sites. Statistical analysis shows that a strongly destabilized intergenic region has an 80% chance of containing a promoter, whereas an intergenic region that does not contain a strong SIDD site has only a 24% chance. We describe how these observations may illuminate specific mechanisms of regulation, and assist in the computational identification of promoter locations in prokaryotes.
Collapse
Affiliation(s)
- Huiquan Wang
- UC Davis Genome Center, University of California, Davis, California 95616, USA
| | | | | |
Collapse
|
21
|
Rombauts S, Florquin K, Lescot M, Marchal K, Rouzé P, van de Peer Y. Computational approaches to identify promoters and cis-regulatory elements in plant genomes. PLANT PHYSIOLOGY 2003; 132:1162-76. [PMID: 12857799 PMCID: PMC167057 DOI: 10.1104/pp.102.017715] [Citation(s) in RCA: 77] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/14/2002] [Revised: 01/10/2003] [Accepted: 03/17/2003] [Indexed: 05/19/2023]
Abstract
The identification of promoters and their regulatory elements is one of the major challenges in bioinformatics and integrates comparative, structural, and functional genomics. Many different approaches have been developed to detect conserved motifs in a set of genes that are either coregulated or orthologous. However, although recent approaches seem promising, in general, unambiguous identification of regulatory elements is not straightforward. The delineation of promoters is even harder, due to its complex nature, and in silico promoter prediction is still in its infancy. Here, we review the different approaches that have been developed for identifying promoters and their regulatory elements. We discuss the detection of cis-acting regulatory elements using word-counting or probabilistic methods (so-called "search by signal" methods) and the delineation of promoters by considering both sequence content and structural features ("search by content" methods). As an example of search by content, we explored in greater detail the association of promoters with CpG islands. However, due to differences in sequence content, the parameters used to detect CpG islands in humans and other vertebrates cannot be used for plants. Therefore, a preliminary attempt was made to define parameters that could possibly define CpG and CpNpG islands in Arabidopsis, by exploring the compositional landscape around the transcriptional start site. To this end, a data set of more than 5,000 gene sequences was built, including the promoter region, the 5'-untranslated region, and the first introns and coding exons. Preliminary analysis shows that promoter location based on the detection of potential CpG/CpNpG islands in the Arabidopsis genome is not straightforward. Nevertheless, because the landscape of CpG/CpNpG islands differs considerably between promoters and introns on the one side and exons (whether coding or not) on the other, more sophisticated approaches can probably be developed for the successful detection of "putative" CpG and CpNpG islands in plants.
Collapse
Affiliation(s)
- Stephane Rombauts
- Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology, Ghent University, B-9000 Gent, Belgium
| | | | | | | | | | | |
Collapse
|
22
|
Cowell LG, Davila M, Yang K, Kepler TB, Kelsoe G. Prospective estimation of recombination signal efficiency and identification of functional cryptic signals in the genome by statistical modeling. J Exp Med 2003; 197:207-20. [PMID: 12538660 PMCID: PMC2193808 DOI: 10.1084/jem.20020250] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2002] [Accepted: 12/05/2002] [Indexed: 12/03/2022] Open
Abstract
The recombination signals (RS) that guide V(D)J recombination are phylogenetically conserved but retain a surprising degree of sequence variability, especially in the nonamer and spacer. To characterize RS variability, we computed the position-wise information, a measure correlated with sequence conservation, for each nucleotide position in an RS alignment and demonstrate that most position-wise information is present in the RS heptamers and nonamers. We have previously demonstrated significant correlations between RS positions and here show that statistical models of the correlation structure that underlies RS variability efficiently identify physiologic and cryptic RS and accurately predict the recombination efficiencies of natural and synthetic RS. In scans of mouse and human genomes, these models identify a highly conserved family of repetitive DNA as an unexpected source of frequent, cryptic RS that rearrange both in extrachromosomal substrates and in their genomic context.
Collapse
Affiliation(s)
- Lindsay G Cowell
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| | | | | | | | | |
Collapse
|
23
|
Cowell LG, Davila M, Kepler TB, Kelsoe G. Identification and utilization of arbitrary correlations in models of recombination signal sequences. Genome Biol 2002; 3:RESEARCH0072. [PMID: 12537561 PMCID: PMC151174 DOI: 10.1186/gb-2002-3-12-research0072] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2002] [Revised: 09/04/2002] [Accepted: 10/10/2002] [Indexed: 01/26/2023] Open
Abstract
BACKGROUND A significant challenge in bioinformatics is to develop methods for detecting and modeling patterns in variable DNA sequence sites, such as protein-binding sites in regulatory DNA. Current approaches sometimes perform poorly when positions in the site do not independently affect protein binding. We developed a statistical technique for modeling the correlation structure in variable DNA sequence sites. The method places no restrictions on the number of correlated positions or on their spatial relationship within the site. No prior empirical evidence for the correlation structure is necessary. RESULTS We applied our method to the recombination signal sequences (RSS) that direct assembly of B-cell and T-cell antigen-receptor genes via V(D)J recombination. The technique is based on model selection by cross-validation and produces models that allow computation of an information score for any signal-length sequence. We also modeled RSS using order zero and order one Markov chains. The scores from all models are highly correlated with measured recombination efficiencies, but the models arising from our technique are better than the Markov models at discriminating RSS from non-RSS. CONCLUSIONS Our model-development procedure produces models that estimate well the recombinogenic potential of RSS and are better at RSS recognition than the order zero and order one Markov models. Our models are, therefore, valuable for studying the regulation of both physiologic and aberrant V(D)J recombination. The approach could be equally powerful for the study of promoter and enhancer elements, splice sites, and other DNA regulatory sites that are highly variable at the level of individual nucleotide positions.
Collapse
Affiliation(s)
- Lindsay G Cowell
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| | - Marco Davila
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| | - Thomas B Kepler
- Center for Bioinformatics and Computational Biology, Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC 27710, USA
| | - Garnett Kelsoe
- Department of Immunology, Duke University Medical Center, Durham, NC 27710, USA
| |
Collapse
|
24
|
Martin RG, Rosner JL. Genomics of the marA/soxS/rob regulon of Escherichia coli: identification of directly activated promoters by application of molecular genetics and informatics to microarray data. Mol Microbiol 2002; 44:1611-24. [PMID: 12067348 DOI: 10.1046/j.1365-2958.2002.02985.x] [Citation(s) in RCA: 119] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Microarray analyses are providing a plethora of data concerning transcriptional responses to specific gene regulators and their inducers but do not distinguish between direct and indirect responses. Here, we identify directly activated promoters of the overlapping marA, soxS and rob regulon(s) of Escherichia coli by applying informatics, genomics and molecular genetics to microarray data obtained by others. Those studies found that overexpression of marA, or the treatment of cells with salicylate to derepress marA, or treatment with paraquat to induce soxS, resulted in elevated transcription of 153 genes. However, only 27 out of the promoters showed increased transcription under at least two of the aforementioned conditions and eight of those were previously known to be directly activated. A computer algorithm was used to identify potential activator binding sites located upstream of the remaining 19 promoters of this subset, and conventional genetic and biochemical approaches were applied to test whether these sites are critical for activation by the homologous MarA, SoxS and Rob transcriptional activators. Only seven out of the 19 promoters were found to be activated when fused to lacZ and tested as single lysogens. All seven contained an essential activator binding site. The remaining promoters were insensitive to stimulation by the inducers suggesting that the great majority of elevated microarray transcripts either were misidentified or resulted from indirect effects requiring sequences outside of the promoter region. We estimate that the total number of directly activated promoters in the regulon is less than 40.
Collapse
Affiliation(s)
- Robert G Martin
- Laboratory of Molecular Biology, National Institute of Diabetes and Digestive and Kidney Diseases/NIH, Bethesda, MD 20892-0560, USA.
| | | |
Collapse
|
25
|
Chen S, Lesnik EA, Hall TA, Sampath R, Griffey RH, Ecker DJ, Blyn LB. A bioinformatics based approach to discover small RNA genes in the Escherichia coli genome. Biosystems 2002; 65:157-77. [PMID: 12069726 DOI: 10.1016/s0303-2647(02)00013-8] [Citation(s) in RCA: 182] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
The recent explosion in available bacterial genome sequences has initiated the need to improve an ability to annotate important sequence and structural elements in a fast, efficient and accurate manner. In particular, small non-coding RNAs (sRNAs) have been difficult to predict. The sRNAs play an important number of structural, catalytic and regulatory roles in the cell. Although a few groups have recently published prediction methods for annotating sRNAs in bacterial genome, much remains to be done in this field. Toward the goal of developing an efficient method for predicting unknown sRNA genes in the completed Escherichia coli genome, we adopted a bioinformatics approach to search for DNA regions that contain a sigma70 promoter within a short distance of a rho-independent terminator. Among a total of 227 candidate sRNA genes initially identified, 32 were previously described sRNAs, orphan tRNAs, and partial tRNA and rRNA operons. Fifty-one are mRNAs genes encoding annotated extremely small open reading frames (ORFs) following an acceptable ribosome binding site. One hundred forty-four are potentially novel non-translatable sRNA genes. Using total RNA isolated from E. coli MG1655 cells grown under four different conditions, we verified transcripts of some of the genes by Northern hybridization. Here we summarize our data and discuss the rules and advantages/disadvantages of using this approach in annotating sRNA genes on bacterial genomes.
Collapse
Affiliation(s)
- Shuo Chen
- Ibis Therapeutics, Isis Pharmaceuticals, Inc, 2292 Faraday Ave, Carlsbad, CA 92008, USA.
| | | | | | | | | | | | | |
Collapse
|
26
|
Abstract
Microarray technologies for measuring mRNA abundances in cells allow monitoring of gene expression levels for tens of thousands of genes in parallel. By measuring expression responses across hundreds of different conditions or timepoints a relatively detailed gene expression map starts to emerge. Using cluster analysis techniques, it is possible to identify genes that are consistently coexpressed under several different conditions or treatments. These sets of coexpressed genes can then be compared to existing knowledge about biochemical or signalling pathways, the function of unknown genes can be hypothesised by comparing them to other genes with characterised function, or from trends in expression profiles in general - why cell needs to transcribe or silence the genes during particular treatment. The regulation of genes on the DNA level is largely guided by particular sequence features, the transcription factor binding sites, and other signals encaptured in DNA. By analyzing the regulatory regions of the DNA of the genes consistently coexpressed, we can discover the potential signals hidden in DNA by computational analysis methods. The prerequisite for this kind of analysis is the existence of genomic DNA sequence, knowledge about gene locations, and experimental gene expression measurements for a variety of conditions. This article surveys some of the analysis methods and studies for such a computational discovery approach for yeast Saccharomyces cerevisiae.
Collapse
Affiliation(s)
- J Vilo
- European Bioinformatics Institute EBI, EMBL Outstation - Hinxton, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK.
| | | |
Collapse
|
27
|
Marsan L, Sagot MF. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol 2001; 7:345-62. [PMID: 11108467 DOI: 10.1089/106652700750050826] [Citation(s) in RCA: 171] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p > or = 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes--that is, the motifs themselves--are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the noncoding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N2n where n is the average length of the sequences and N their number. An application to the identification of promoter and regulatory consensus sequences in bacterial genomes is shown.
Collapse
Affiliation(s)
- L Marsan
- Institut Gaspard Monge, Université de Marne la Vallée 5
| | | |
Collapse
|