51
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
52
|
Fan D, Bitterman PB, Larsson O. Regulatory element identification in subsets of transcripts: comparison and integration of current computational methods. RNA (NEW YORK, N.Y.) 2009; 15:1469-82. [PMID: 19553345 PMCID: PMC2714745 DOI: 10.1261/rna.1617009] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/25/2009] [Accepted: 05/20/2009] [Indexed: 05/20/2023]
Abstract
Regulatory elements in mRNA play an often pivotal role in post-transcriptional regulation of gene expression. However, a systematic approach to efficiently identify putative regulatory elements from sets of post-transcriptionally coregulated genes is lacking, hampering studies of coregulation mechanisms. Although there are several analytical methods that can be used to detect conserved mRNA regulatory elements in a set of transcripts, there has been no systematic study of how well any of these methods perform individually or as a group. We therefore compared how well three algorithms, each based on a different principle (enumeration, optimization, or structure/sequence profiles), can identify elements in unaligned untranslated sequence regions. Two algorithms were originally designed to detect transcription factor binding sites, Weeder and BioProspector; and one was designed to detect RNA elements conserved in structure, RNAProfile. Three types of elements were examined: (1) elements conserved in both primary sequence and secondary structure; (2) elements conserved only in primary sequence; and (3) microRNA targets. Our results indicate that all methods can uniquely identify certain known RNA elements, and therefore, integrating the output from all algorithms leads to the most complete identification of elements. We therefore developed an approach to integrate results and guide selection of candidate elements from several algorithms presented as a web service (https://dbw.msi.umn.edu:8443/recit). These findings together with the approach for integration can be used to identify candidate elements from genome-wide post-transcriptional profiling data sets.
Collapse
Affiliation(s)
- Danhua Fan
- Department of Medicine, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | | | | |
Collapse
|
53
|
Abstract
Motivation: The motif discovery problem consists of finding over-represented patterns in a collection of biosequences. It is one of the classical sequence analysis problems, but still has not been satisfactorily solved in an exact and efficient manner. This is partly due to the large number of possibilities of defining the motif search space and the notion of over-representation. Even for well-defined formalizations, the problem is frequently solved in an ad hoc manner with heuristics that do not guarantee to find the best motif. Results: We show how to solve the motif discovery problem (almost) exactly on a practically relevant space of IUPAC generalized string patterns, using the p-value with respect to an i.i.d. model or a Markov model as the measure of over-representation. In particular, (i) we use a highly accurate compound Poisson approximation for the null distribution of the number of motif occurrences. We show how to compute the exact clump size distribution using a recently introduced device called probabilistic arithmetic automaton (PAA). (ii) We define two p-value scores for over-representation, the first one based on the total number of motif occurrences, the second one based on the number of sequences in a collection with at least one occurrence. (iii) We describe an algorithm to discover the optimal pattern with respect to either of the scores. The method exploits monotonicity properties of the compound Poisson approximation and is by orders of magnitude faster than exhaustive enumeration of IUPAC strings (11.8 h compared with an extrapolated runtime of 4.8 years). (iv) We justify the use of the proposed scores for motif discovery by showing our method to outperform other motif discovery algorithms (e.g. MEME, Weeder) on benchmark datasets. We also propose new motifs on Mycobacterium tuberculosis. Availability and Implementation: The method has been implemented in Java. It can be obtained from http://ls11-www.cs.tu-dortmund.de/people/marschal/paa_md/ Contact:tobias.marschall@tu-dortmund.de; sven.rahmann@tu-dortmund.de
Collapse
Affiliation(s)
- Tobias Marschall
- Computer Science Department, Bioinformatics for High-Throughput Technologies at the Chair of Algorithm Engineering, TU Dortmund, Dortmund, Germany.
| | | |
Collapse
|
54
|
Chang DTH, Chien TY, Chen CY. seeMotif: exploring and visualizing sequence motifs in 3D structures. Nucleic Acids Res 2009; 37:W552-8. [PMID: 19477961 PMCID: PMC2703912 DOI: 10.1093/nar/gkp439] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2009] [Revised: 04/23/2009] [Accepted: 05/11/2009] [Indexed: 12/17/2022] Open
Abstract
Sequence motifs are important in the study of molecular biology. Motif discovery tools efficiently deliver many function related signatures of proteins and largely facilitate sequence annotation. As increasing numbers of motifs are detected experimentally or predicted computationally, characterizing the functional roles of motifs and identifying the potential synergetic relationships between them are important next steps. A good way to investigate novel motifs is to utilize the abundant 3D structures that have also been accumulated at an astounding rate in recent years. This article reports the development of the web service seeMotif, which provides users with an interactive interface for visualizing sequence motifs on protein structures from the Protein Data Bank (PDB). Researchers can quickly see the locations and conformation of multiple motifs among a number of related structures simultaneously. Considering the fact that PDB sequences are usually shorter than those in sequence databases and/or may have missing residues, seeMotif has two complementary approaches for selecting structures and mapping motifs to protein chains in structures. As more and more structures belonging to previously uncharacterized protein families become available, combining sequence and structure information gives good opportunities to facilitate understanding of protein functions in large-scale genome projects. Available at: http://seemotif.csie.ntu.edu.tw,http://seemotif.ee.ncku.edu.tw or http://seemotif.csbb.ntu.edu.tw.
Collapse
Affiliation(s)
- Darby Tien-Hao Chang
- Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Department of Computer Science and Information Engineering and Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, 10617, Taiwan, R.O.C
| | - Ting-Ying Chien
- Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Department of Computer Science and Information Engineering and Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, 10617, Taiwan, R.O.C
| | - Chien-Yu Chen
- Department of Electrical Engineering, National Cheng Kung University, Tainan 70101, Department of Computer Science and Information Engineering and Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, 10617, Taiwan, R.O.C
| |
Collapse
|
55
|
Abstract
MOTIVATION Identifying transcription factor binding sites (TFBSs) encoding complex regulatory signals in metazoan genomes remains a challenging problem in computational genomics. Due to degeneracy of nucleotide content among binding site instances or motifs, and intricate 'grammatical organization' of motifs within cis-regulatory modules (CRMs), extant pattern matching-based in silico motif search methods often suffer from impractically high false positive rates, especially in the context of analyzing large genomic datasets, and noisy position weight matrices which characterize binding sites. Here, we try to address this problem by using a framework to maximally utilize the information content of the genomic DNA in the region of query, taking cues from values of various biologically meaningful genetic and epigenetic factors in the query region such as clade-specific evolutionary parameters, presence/absence of nearby coding regions, etc. We present a new method for TFBS prediction in metazoan genomes that utilizes both the CRM architecture of sequences and a variety of features of individual motifs. Our proposed approach is based on a discriminative probabilistic model known as conditional random fields that explicitly optimizes the predictive probability of motif presence in large sequences, based on the joint effect of all such features. RESULTS This model overcomes weaknesses in earlier methods based on less effective statistical formalisms that are sensitive to spurious signals in the data. We evaluate our method on both simulated CRMs and real Drosophila sequences in comparison with a wide spectrum of existing models, and outperform the state of the art by 22% in F1 score. AVAILABILITY AND IMPLEMENTATION The code is publicly available at http://www.sailing.cs.cmu.edu/discover.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenjie Fu
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | |
Collapse
|
56
|
Tang MHE, Krogh A, Winther O. BayesMD: flexible biological modeling for motif discovery. J Comput Biol 2009; 15:1347-63. [PMID: 19040368 DOI: 10.1089/cmb.2007.0176] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present BayesMD, a Bayesian Motif Discovery model with several new features. Three different types of biological a priori knowledge are built into the framework in a modular fashion. A mixture of Dirichlets is used as prior over nucleotide probabilities in binding sites. It is trained on transcription factor (TF) databases in order to extract the typical properties of TF binding sites. In a similar fashion we train organism-specific priors for the background sequences. Lastly, we use a prior over the position of binding sites. This prior represents information complementary to the motif and background priors coming from conservation, local sequence complexity, nucleosome occupancy, etc. and assumptions about the number of occurrences. The Bayesian inference is carried out using a combination of exact marginalization (multinomial parameters) and sampling (over the position of sites). Robust sampling results are achieved using the advanced sampling method parallel tempering. In a post-analysis step candidate motifs with high marginal probability are found by searching among those motifs that contain sites that occur frequently. Thereby, maximum a posteriori inference for the motifs is avoided and the marginal probabilities can be used directly to assess the significance of the findings. The framework is benchmarked against other methods on a number of real and artificial data sets. The accompanying prediction server, documentation, software, models and data are available from http://bayesmd.binf.ku.dk/.
Collapse
Affiliation(s)
- Man-Hung Eric Tang
- Bioinformatics Centre, Department of Molecular Biology, University of Copenhagen, Copenhagen, Denmark
| | | | | |
Collapse
|
57
|
Zare-Mirakabad F, Ahrabian H, Sadeghi M, Hashemifar S, Nowzari-Dalini A, Goliaei B. Genetic algorithm for dyad pattern finding in DNA sequences. Genes Genet Syst 2009; 84:81-93. [DOI: 10.1266/ggs.84.81] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Affiliation(s)
- Fatemeh Zare-Mirakabad
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran
| | - Hayedeh Ahrabian
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran
- Center of Excellence in Biomathematics, School of Mathematics, Statistics, and Computer Science, University of Tehran
| | - Mehdi Sadeghi
- National Institute of Genetic Engendering and Biotechnology
- School of Computer Science, Institute for Studies in Theoretical Physics and Mathematics (IPM)
| | - Somaieh Hashemifar
- Center of Excellence in Biomathematics, School of Mathematics, Statistics, and Computer Science, University of Tehran
| | - Abbas Nowzari-Dalini
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran
- Center of Excellence in Biomathematics, School of Mathematics, Statistics, and Computer Science, University of Tehran
| | - Bahram Goliaei
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran
| |
Collapse
|
58
|
Gupta M. Model selection and sensitivity analysis for sequence pattern models. INSTITUTE OF MATHEMATICAL STATISTICS COLLECTIONS 2009; 1:390-407. [PMID: 20563269 PMCID: PMC2887058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
In this article we propose a maximal a posteriori (MAP) criterion for model selection in the motif discovery problem and investigate conditions under which the MAP asymptotically gives a correct prediction of model size. We also investigate robustness of the MAP to prior specification and provide guidelines for choosing prior hyper-parameters for motif models based on sensitivity considerations.
Collapse
|
59
|
Sandve GK, Abul O, Drabløs F. Compo: composite motif discovery using discrete models. BMC Bioinformatics 2008; 9:527. [PMID: 19063744 PMCID: PMC2614996 DOI: 10.1186/1471-2105-9-527] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2008] [Accepted: 12/08/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational discovery of motifs in biomolecular sequences is an established field, with applications both in the discovery of functional sites in proteins and regulatory sites in DNA. In recent years there has been increased attention towards the discovery of composite motifs, typically occurring in cis-regulatory regions of genes. RESULTS This paper describes Compo: a discrete approach to composite motif discovery that supports richer modeling of composite motifs and a more realistic background model compared to previous methods. Furthermore, multiple parameter and threshold settings are tested automatically, and the most interesting motifs across settings are selected. This avoids reliance on single hard thresholds, which has been a weakness of previous discrete methods. Comparison of motifs across parameter settings is made possible by the use of p-values as a general significance measure. Compo can either return an ordered list of motifs, ranked according to the general significance measure, or a Pareto front corresponding to a multi-objective evaluation on sensitivity, specificity and spatial clustering. CONCLUSION Compo performs very competitively compared to several existing methods on a collection of benchmark data sets. These benchmarks include a recently published, large benchmark suite where the use of support across sequences allows Compo to correctly identify binding sites even when the relevant PWMs are mixed with a large number of noise PWMs. Furthermore, the possibility of parameter-free running offers high usability, the support for multi-objective evaluation allows a rich view of potential regulators, and the discrete model allows flexibility in modeling and interpretation of motifs.
Collapse
Affiliation(s)
- Geir Kjetil Sandve
- Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway.
| | | | | |
Collapse
|
60
|
Lintner RE, Mishra PK, Srivastava P, Martinez-Vaz BM, Khodursky AB, Blumenthal RM. Limited functional conservation of a global regulator among related bacterial genera: Lrp in Escherichia, Proteus and Vibrio. BMC Microbiol 2008; 8:60. [PMID: 18405378 PMCID: PMC2374795 DOI: 10.1186/1471-2180-8-60] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2007] [Accepted: 04/11/2008] [Indexed: 02/03/2023] Open
Abstract
Background Bacterial genome sequences are being determined rapidly, but few species are physiologically well characterized. Predicting regulation from genome sequences usually involves extrapolation from better-studied bacteria, using the hypothesis that a conserved regulator, conserved target gene, and predicted regulator-binding site in the target promoter imply conserved regulation between the two species. However many compared organisms are ecologically and physiologically diverse, and the limits of extrapolation have not been well tested. In E. coli K-12 the leucine-responsive regulatory protein (Lrp) affects expression of ~400 genes. Proteus mirabilis and Vibrio cholerae have highly-conserved lrp orthologs (98% and 92% identity to E. coli lrp). The functional equivalence of Lrp from these related species was assessed. Results Heterologous Lrp regulated gltB, livK and lrp transcriptional fusions in an E. coli background in the same general way as the native Lrp, though with significant differences in extent. Microarray analysis of these strains revealed that the heterologous Lrp proteins significantly influence only about half of the genes affected by native Lrp. In P. mirabilis, heterologous Lrp restored swarming, though with some pattern differences. P. mirabilis produced substantially more Lrp than E. coli or V. cholerae under some conditions. Lrp regulation of target gene orthologs differed among the three native hosts. Strikingly, while Lrp negatively regulates its own gene in E. coli, and was shown to do so even more strongly in P. mirabilis, Lrp appears to activate its own gene in V. cholerae. Conclusion The overall similarity of regulatory effects of the Lrp orthologs supports the use of extrapolation between related strains for general purposes. However this study also revealed intrinsic differences even between orthologous regulators sharing >90% overall identity, and 100% identity for the DNA-binding helix-turn-helix motif, as well as differences in the amounts of those regulators. These results suggest that predicting regulation of specific target genes based on genome sequence comparisons alone should be done on a conservative basis.
Collapse
Affiliation(s)
- Robert E Lintner
- Department of Medical Microbiology and Immunology, University of Toledo Health Sciences Center, Toledo, OH 43614-2598, USA.
| | | | | | | | | | | |
Collapse
|
61
|
Ausiello G, Gherardini PF, Marcatili P, Tramontano A, Via A, Helmer-Citterich M. FunClust: a web server for the identification of structural motifs in a set of non-homologous protein structures. BMC Bioinformatics 2008; 9 Suppl 2:S2. [PMID: 18387204 PMCID: PMC2323665 DOI: 10.1186/1471-2105-9-s2-s2] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The occurrence of very similar structural motifs brought about by different parts of non homologous proteins is often indicative of a common function. Indeed, relatively small local structures can mediate binding to a common partner, be it a protein, a nucleic acid, a cofactor or a substrate. While it is relatively easy to identify short amino acid or nucleotide sequence motifs in a given set of proteins or genes, and many methods do exist for this purpose, much more challenging is the identification of common local substructures, especially if they are formed by non consecutive residues in the sequence. RESULTS Here we describe a publicly available tool, able to identify common structural motifs shared by different non homologous proteins in an unsupervised mode. The motifs can be as short as three residues and need not to be contiguous or even present in the same order in the sequence. Users can submit a set of protein structures deemed or not to share a common function (e.g. they bind similar ligands, or share a common epitope). The server finds and lists structural motifs composed of three or more spatially well conserved residues shared by at least three of the submitted structures. The method uses a local structural comparison algorithm to identify subsets of similar amino acids between each pair of input protein chains and a clustering procedure to group similarities shared among different structure pairs. CONCLUSIONS FunClust is fast, completely sequence independent, and does not need an a priori knowledge of the motif to be found. The output consists of a list of aligned structural matches displayed in both tabular and graphical form. We show here examples of its usefulness by searching for the largest common structural motifs in test sets of non homologous proteins and showing that the identified motifs correspond to a known common functional feature.
Collapse
Affiliation(s)
- Gabriele Ausiello
- Centre for Molecular Bioinformatics, Department of Biology, University of Rome "Tor Vergata", Rome, Italy.
| | | | | | | | | | | |
Collapse
|
62
|
Klepper K, Sandve GK, Abul O, Johansen J, Drablos F. Assessment of composite motif discovery methods. BMC Bioinformatics 2008; 9:123. [PMID: 18302777 PMCID: PMC2311304 DOI: 10.1186/1471-2105-9-123] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2007] [Accepted: 02/26/2008] [Indexed: 12/26/2022] Open
Abstract
Background Computational discovery of regulatory elements is an important area of bioinformatics research and more than a hundred motif discovery methods have been published. Traditionally, most of these methods have addressed the problem of single motif discovery – discovering binding motifs for individual transcription factors. In higher organisms, however, transcription factors usually act in combination with nearby bound factors to induce specific regulatory behaviours. Hence, recent focus has shifted from single motifs to the discovery of sets of motifs bound by multiple cooperating transcription factors, so called composite motifs or cis-regulatory modules. Given the large number and diversity of methods available, independent assessment of methods becomes important. Although there have been several benchmark studies of single motif discovery, no similar studies have previously been conducted concerning composite motif discovery. Results We have developed a benchmarking framework for composite motif discovery and used it to evaluate the performance of eight published module discovery tools. Benchmark datasets were constructed based on real genomic sequences containing experimentally verified regulatory modules, and the module discovery programs were asked to predict both the locations of these modules and to specify the single motifs involved. To aid the programs in their search, we provided position weight matrices corresponding to the binding motifs of the transcription factors involved. In addition, selections of decoy matrices were mixed with the genuine matrices on one dataset to test the response of programs to varying levels of noise. Conclusion Although some of the methods tested tended to score somewhat better than others overall, there were still large variations between individual datasets and no single method performed consistently better than the rest in all situations. The variation in performance on individual datasets also shows that the new benchmark datasets represents a suitable variety of challenges to most methods for module discovery.
Collapse
Affiliation(s)
- Kjetil Klepper
- Department of Cancer Reasearch and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.
| | | | | | | | | |
Collapse
|
63
|
|
64
|
Ferreira PG, Azevedo PJ. Evaluating deterministic motif significance measures in protein databases. Algorithms Mol Biol 2007; 2:16. [PMID: 18157916 PMCID: PMC2254621 DOI: 10.1186/1748-7188-2-16] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2007] [Accepted: 12/24/2007] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Assessing the outcome of motif mining algorithms is an essential task, as the number of reported motifs can be very large. Significance measures play a central role in automatically ranking those motifs, and therefore alleviating the analysis work. Spotting the most interesting and relevant motifs is then dependent on the choice of the right measures. The combined use of several measures may provide more robust results. However caution has to be taken in order to avoid spurious evaluations. RESULTS From the set of conducted experiments, it was verified that several of the selected significance measures show a very similar behavior in a wide range of situations therefore providing redundant information. Some measures have proved to be more appropriate to rank highly conserved motifs, while others are more appropriate for weakly conserved ones. Support appears as a very important feature to be considered for correct motif ranking. We observed that not all the measures are suitable for situations with poorly balanced class information, like for instance, when positive data is significantly less than negative data. Finally, a visualization scheme was proposed that, when several measures are applied, enables an easy identification of high scoring motifs. CONCLUSION In this work we have surveyed and categorized 14 significance measures for pattern evaluation. Their ability to rank three types of deterministic motifs was evaluated. Measures were applied in different testing conditions, where relations were identified. This study provides some pertinent insights on the choice of the right set of significance measures for the evaluation of deterministic motifs extracted from protein databases.
Collapse
Affiliation(s)
- Pedro Gabriel Ferreira
- Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| | - Paulo J Azevedo
- Department of Informatics, University of Minho, Campus de Gualtar, 4710-057 Braga, Portugal
| |
Collapse
|
65
|
Danko CG, McIlvain VA, Qin M, Knox BE, Pertsov AM. Bioinformatic identification of novel putative photoreceptor specific cis-elements. BMC Bioinformatics 2007; 8:407. [PMID: 17953763 PMCID: PMC2225425 DOI: 10.1186/1471-2105-8-407] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2007] [Accepted: 10/22/2007] [Indexed: 11/10/2022] Open
Abstract
Background Cell specific gene expression is largely regulated by different combinations of transcription factors that bind cis-elements in the upstream promoter sequence. However, experimental detection of cis-elements is difficult, expensive, and time-consuming. This provides a motivation for developing bioinformatic methods to identify cis-elements that could prioritize future experimental studies. Here, we use motif discovery algorithms to predict transcription factor binding sites involved in regulating the differences between murine rod and cone photoreceptor populations. Results To identify highly conserved motifs enriched in promoters that drive expression in either rod or cone photoreceptors, we assembled a set of murine rod-specific, cone-specific, and non-photoreceptor background promoter sequences. These sets were used as input to a newly devised motif discovery algorithm called Iterative Alignment/Modular Motif Selection (IAMMS). Using IAMMS, we predicted 34 motifs that may contribute to rod-specific (19 motifs) or cone-specific (15 motifs) expression patterns. Of these, 16 rod- and 12 cone-specific motifs were found in clusters near the transcription start site. New findings include the observation that cone promoters tend to contain TATA boxes, while rod promoters tend to be TATA-less (exempting Rho and Cnga1). Additionally, we identify putative sites for IL-6 effectors (in rods) and RXR family members (in cones) that can explain experimental data showing changes to cell-fate by activating these signaling pathways during rod/cone development. Two of the predicted motifs (NRE and ROP2) have been confirmed experimentally to be involved in cell-specific expression patterns. We provide a full database of predictions as additional data that may contain further valuable information. IAMMS predictions are compared with existing motif discovery algorithms, DME and BioProspector. We find that over 60% of IAMMS predictions are confirmed by at least one other motif discovery algorithm. Conclusion We predict novel, putative cis-elements enriched in the promoter of rod-specific or cone-specific genes. These are candidate binding sites for transcription factors involved in maintaining functional differences between rod and cone photoreceptor populations.
Collapse
Affiliation(s)
- Charles G Danko
- Department of Pharmacology, SUNY Upstate Medical University, Syracuse, NY, USA.
| | | | | | | | | |
Collapse
|
66
|
Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules. Algorithms Mol Biol 2007; 2:13. [PMID: 17927813 PMCID: PMC2174486 DOI: 10.1186/1748-7188-2-13] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2007] [Accepted: 10/10/2007] [Indexed: 11/15/2022] Open
Abstract
Background cis-Regulatory modules (CRMs) of eukaryotic genes often contain multiple binding sites for transcription factors. The phenomenon that binding sites form clusters in CRMs is exploited in many algorithms to locate CRMs in a genome. This gives rise to the problem of calculating the statistical significance of the event that multiple sites, recognized by different factors, would be found simultaneously in a text of a fixed length. The main difficulty comes from overlapping occurrences of motifs. So far, no tools have been developed allowing the computation of p-values for simultaneous occurrences of different motifs which can overlap. Results We developed and implemented an algorithm computing the p-value that s different motifs occur respectively k1, ..., ks or more times, possibly overlapping, in a random text. Motifs can be represented with a majority of popular motif models, but in all cases, without indels. Zero or first order Markov chains can be adopted as a model for the random text. The computational tool was tested on the set of cis-regulatory modules involved in D. melanogaster early development, for which there exists an annotation of binding sites for transcription factors. Our test allowed us to correctly identify transcription factors cooperatively/competitively binding to DNA. Method The algorithm that precisely computes the probability of simultaneous motif occurrences is inspired by the Aho-Corasick automaton and employs a prefix tree together with a transition function. The algorithm runs with the O(n|Σ|(m|ℋ
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| + K|σ|K) ∏i ki) time complexity, where n is the length of the text, |Σ| is the alphabet size, m is the maximal motif length, |ℋ
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaat0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqWFlecsaaa@3762@| is the total number of words in motifs, K is the order of Markov model, and ki is the number of occurrences of the ith motif. Conclusion The primary objective of the program is to assess the likelihood that a given DNA segment is CRM regulated with a known set of regulatory factors. In addition, the program can also be used to select the appropriate threshold for PWM scanning. Another application is assessing similarity of different motifs. Availability Project web page, stand-alone version and documentation can be found at
Collapse
|
67
|
Lones M, Tyrrell A. Regulatory motif discovery using a population clustering evolutionary algorithm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:403-414. [PMID: 17666760 DOI: 10.1109/tcbb.2007.1044] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
This paper describes a novel evolutionary algorithm for regulatory motif discovery in DNA promoter sequences. The algorithm uses data clustering to logically distribute the evolving population across the search space. Mating then takes place within local regions of the population, promoting overall solution diversity and encouraging discovery of multiple solutions. Experiments using synthetic data sets have demonstrated the algorithm's capacity to find position frequency matrix models of known regulatory motifs in relatively long promoter sequences. These experiments have also shown the algorithm's ability to maintain diversity during search and discover multiple motifs within a single population. The utility of the algorithm for discovering motifs in real biological data is demonstrated by its ability to find meaningful motifs within muscle-specific regulatory sequences.
Collapse
|
68
|
Thompson WA, Newberg LA, Conlan S, McCue LA, Lawrence CE. The Gibbs Centroid Sampler. Nucleic Acids Res 2007; 35:W232-7. [PMID: 17483517 PMCID: PMC1933196 DOI: 10.1093/nar/gkm265] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2007] [Revised: 03/27/2007] [Accepted: 04/08/2007] [Indexed: 11/25/2022] Open
Abstract
The Gibbs Centroid Sampler is a software package designed for locating conserved elements in biopolymer sequences. The Gibbs Centroid Sampler reports a centroid alignment, i.e. an alignment that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding-site alignments. In so doing, it garners information from the full ensemble of solutions, rather than only the single most probable point that is the target of many motif-finding algorithms, including its predecessor, the Gibbs Recursive Sampler. Centroid estimators have been shown to yield substantial improvements, in both sensitivity and positive predictive values, to the prediction of RNA secondary structure and motif finding. The Gibbs Centroid Sampler, along with interactive tutorials, an online user manual, and information on downloading the software, is available at: http://bayesweb.wadsworth.org/gibbs/gibbs.html.
Collapse
Affiliation(s)
- William A Thompson
- Center for Computational Molecular Biology and the Division of Applied Mathematics, Brown University, Providence, RI 02912, USA.
| | | | | | | | | |
Collapse
|
69
|
Sandve GK, Abul O, Walseng V, Drabløs F. Improved benchmarks for computational motif discovery. BMC Bioinformatics 2007; 8:193. [PMID: 17559676 PMCID: PMC1903367 DOI: 10.1186/1471-2105-8-193] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2006] [Accepted: 06/08/2007] [Indexed: 12/03/2022] Open
Abstract
Background An important step in annotation of sequenced genomes is the identification of transcription factor binding sites. More than a hundred different computational methods have been proposed, and it is difficult to make an informed choice. Therefore, robust assessment of motif discovery methods becomes important, both for validation of existing tools and for identification of promising directions for future research. Results We use a machine learning perspective to analyze collections of transcription factors with known binding sites. Algorithms are presented for finding position weight matrices (PWMs), IUPAC-type motifs and mismatch motifs with optimal discrimination of binding sites from remaining sequence. We show that for many data sets in a recently proposed benchmark suite for motif discovery, none of the common motif models can accurately discriminate the binding sites from remaining sequence. This may obscure the distinction between the potential performance of the motif discovery tool itself versus the intrinsic complexity of the problem we are trying to solve. Synthetic data sets may avoid this problem, but we show on some previously proposed benchmarks that there may be a strong bias towards a presupposed motif model. We also propose a new approach to benchmark data set construction. This approach is based on collections of binding site fragments that are ranked according to the optimal level of discrimination achieved with our algorithms. This allows us to select subsets with specific properties. We present one benchmark suite with data sets that allow good discrimination between positive and negative instances with the common motif models. These data sets are suitable for evaluating algorithms for motif discovery that rely on these models. We present another benchmark suite where PWM, IUPAC and mismatch motif models are not able to discriminate reliably between positive and negative instances. This suite could be used for evaluating more powerful motif models. Conclusion Our improved benchmark suites have been designed to differentiate between the performance of motif discovery algorithms and the power of motif models. We provide a web server where users can download our benchmark suites, submit predictions and visualize scores on the benchmarks.
Collapse
Affiliation(s)
- Geir Kjetil Sandve
- Department of Computer and Information Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Osman Abul
- Department of Computer Engineering, TOBB University of Economics and Technology, Ankara, Turkey
| | - Vegard Walseng
- Department of Computer and Information Science, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Trondheim, Norway
| |
Collapse
|
70
|
Carlson JM, Chakravarty A, DeZiel CE, Gross RH. SCOPE: a web server for practical de novo motif discovery. Nucleic Acids Res 2007; 35:W259-64. [PMID: 17485471 PMCID: PMC1933170 DOI: 10.1093/nar/gkm310] [Citation(s) in RCA: 85] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
SCOPE is a novel parameter-free method for the de novo identification of potential regulatory motifs in sets of coordinately regulated genes. The SCOPE algorithm combines the output of three component algorithms, each designed to identify a particular class of motifs. Using an ensemble learning approach, SCOPE identifies the best candidate motifs from its component algorithms. In tests on experimentally determined datasets, SCOPE identified motifs with a significantly higher level of accuracy than a number of other web-based motif finders run with their default parameters. Because SCOPE has no adjustable parameters, the web server has an intuitive interface, requiring only a set of gene names or FASTA sequences and a choice of species. The most significant motifs found by SCOPE are displayed graphically on the main results page with a table containing summary statistics for each motif. Detailed motif information, including the sequence logo, PWM, consensus sequence and specific matching sites can be viewed through a single click on a motif. SCOPE's efficient, parameter-free search strategy has enabled the development of a web server that is readily accessible to the practising biologist while providing results that compare favorably with those of other motif finders. The SCOPE web server is at <http://genie.dartmouth.edu/scope>.
Collapse
Affiliation(s)
- Jonathan M. Carlson
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
| | - Arijit Chakravarty
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
| | - Charles E. DeZiel
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
| | - Robert H. Gross
- Department of Computer Science and Engineering, University of Washington, Seattle, WA, Department of Cancer Pharmacology, Millennium Pharmaceuticals Inc., Cambridge, MA and Department of Biological Sciences, Dartmouth College, Hanover, NH, USA
- *To whom correspondence should be addressed. +603 646 2059+603 646 1347
| |
Collapse
|
71
|
Abnizova I, Subhankulova T, Gilks WR. Recent computational approaches to understand gene regulation: mining gene regulation in silico. Curr Genomics 2007; 8:79-91. [PMID: 18660846 PMCID: PMC2435357 DOI: 10.2174/138920207780368150] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2006] [Revised: 12/13/2006] [Accepted: 12/15/2006] [Indexed: 01/03/2023] Open
Abstract
This paper reviews recent computational approaches to the understanding of gene regulation in eukaryotes. Cis-regulation of gene expression by the binding of transcription factors is a critical component of cellular physiology. In eukaryotes, a number of transcription factors often work together in a combinatorial fashion to enable cells to respond to a wide spectrum of environmental and developmental signals. Integration of genome sequences and/or Chromatin Immunoprecipitation on chip data with gene-expression data has facilitated in silico discovery of how the combinatorics and positioning of transcription factors binding sites underlie gene activation in a variety of cellular processes.The process of gene regulation is extremely complex and intriguing, therefore all possible points of view and related links should be carefully considered. Here we attempt to collect an inventory, not claiming it to be comprehensive and complete, of related computational biological topics covering gene regulation, which may en-lighten the process, and briefly review what is currently occurring in these areas.We will consider the following computational areas:o gene regulatory network construction;o evolution of regulatory DNA;o studies of its structural and statistical informational properties;o and finally, regulatory RNA.
Collapse
Affiliation(s)
| | - T Subhankulova
- Wellcome Trust/Cancer Research UK Gurdon Institute of Cancer and Developmental Biology, Cambridge, UK
| | | |
Collapse
|