1
|
A search for the physical basis of the genetic code. Biosystems 2020; 195:104148. [DOI: 10.1016/j.biosystems.2020.104148] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Revised: 04/09/2020] [Accepted: 04/09/2020] [Indexed: 01/01/2023]
|
2
|
O'Neill PK, Erill I. Parametric bootstrapping for biological sequence motifs. BMC Bioinformatics 2016; 17:406. [PMID: 27716039 PMCID: PMC5052923 DOI: 10.1186/s12859-016-1246-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2016] [Accepted: 09/08/2016] [Indexed: 11/10/2022] Open
Abstract
Background Biological sequence motifs drive the specific interactions of proteins and nucleic acids. Accordingly, the effective computational discovery and analysis of such motifs is a central theme in bioinformatics. Many practical questions about the properties of motifs can be recast as random sampling problems. In this light, the task is to determine for a given motif whether a certain feature of interest is statistically unusual among relevantly similar alternatives. Despite the generality of this framework, its use has been frustrated by the difficulties of defining an appropriate reference class of motifs for comparison and of sampling from it effectively. Results We define two distributions over the space of all motifs of given dimension. The first is the maximum entropy distribution subject to mean information content, and the second is the truncated uniform distribution over all motifs having information content within a given interval. We derive exact sampling algorithms for each. As a proof of concept, we employ these sampling methods to analyze a broad collection of prokaryotic and eukaryotic transcription factor binding site motifs. In addition to positional information content, we consider the informational Gini coefficient of the motif, a measure of the degree to which information is evenly distributed throughout a motif’s positions. We find that both prokaryotic and eukaryotic motifs tend to exhibit higher informational Gini coefficients (IGC) than would be expected by chance under either reference distribution. As a second application, we apply maximum entropy sampling to the motif p-value problem and use it to give elementary derivations of two new estimators. Conclusions Despite the historical centrality of biological sequence motif analysis, this study constitutes to our knowledge the first use of principled null hypotheses for sequence motifs given information content. Through their use, we are able to characterize for the first time differerences in global motif statistics between biological motifs and their null distributions. In particular, we observe that biological sequence motifs show an unusual distribution of IGC, presumably due to biochemical constraints on the mechanisms of direct read-out. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1246-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Patrick K O'Neill
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, 21250, US
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, 21250, US.
| |
Collapse
|
3
|
Zhang Y, He Y, Zheng G, Wei C. MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures. BMC Genomics 2015; 16 Suppl 7:S13. [PMID: 26099518 PMCID: PMC4474412 DOI: 10.1186/1471-2164-16-s7-s13] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Background Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically. Result Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods. Conclusion MOST+ is a fast and accurate de novo method for motif finding by integrating genomic sequence and experimental signals as clues.
Collapse
|
4
|
O'Neill PK, Forder R, Erill I. Informational requirements for transcriptional regulation. J Comput Biol 2014; 21:373-84. [PMID: 24689750 DOI: 10.1089/cmb.2014.0032] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Transcription factors (TFs) regulate transcription by binding to specific sites in promoter regions. Information theory provides a useful mathematical framework to analyze the binding motifs associated with TFs but imposes several assumptions that limit their applicability to specific regulatory scenarios. Explicit simulations of the co-evolution of TFs and their binding motifs allow the study of the evolution of regulatory networks with a high degree of realism. In this work we analyze the impact of differential regulatory demands on the information content of TF-binding motifs by means of evolutionary simulations. We generalize a predictive index based on information theory, and we validate its applicability to regulatory scenarios in which the TF binds significantly to the genomic background. Our results show a logarithmic dependence of the evolved information content on the occupancy of target sites and indicate that TFs may actively exploit pseudo-sites to modulate their occupancy of target sites. In regulatory networks with differentially regulated targets, we observe that information content in TF-binding motifs is dictated primarily by the fraction of total probability mass that the TF assigns to its target sites, and we provide a predictive index to estimate the amount of information associated with arbitrarily complex regulatory systems. We observe that complex regulatory patterns can exert additional demands on evolved information content, but, given a total occupancy for target sites, we do not find conclusive evidence that this effect is because of the range of required binding affinities.
Collapse
Affiliation(s)
- Patrick K O'Neill
- 1 Department of Biological Sciences, University of Maryland Baltimore County , Baltimore, Maryland
| | | | | |
Collapse
|
5
|
Davies PC, Rieper E, Tuszynski JA. Self-organization and entropy reduction in a living cell. Biosystems 2013; 111:1-10. [PMID: 23159919 PMCID: PMC3712629 DOI: 10.1016/j.biosystems.2012.10.005] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2012] [Revised: 10/22/2012] [Accepted: 10/22/2012] [Indexed: 12/19/2022]
Abstract
In this paper we discuss the entropy and information aspects of a living cell. Particular attention is paid to the information gain on assembling and maintaining a living state. Numerical estimates of the information and entropy reduction are given and discussed in the context of the cell's metabolic activity. We discuss a solution to an apparent paradox that there is less information content in DNA than in the proteins that are assembled based on the genetic code encrypted in DNA. When energy input required for protein synthesis is accounted for, the paradox is clearly resolved. Finally, differences between biological information and instruction are discussed.
Collapse
Affiliation(s)
- Paul C.W. Davies
- Beyond Center for Fundamental Concepts in Science, Arizona State University, Tempe, AZ 85287-1504, USA
| | - Elisabeth Rieper
- Centre for Quantum Technologies, National University of Singapore, Block S15, 3 Science Drive 2, Singapore 117543, Singapore
| | - Jack A. Tuszynski
- Department of Physics, University of Alberta, Edmonton, Alberta T6G 2E1, Canada
| |
Collapse
|
6
|
Mahdevar G, Sadeghi M, Nowzari-Dalini A. Transcription factor binding sites detection by using alignment-based approach. J Theor Biol 2012; 304:96-102. [PMID: 22504445 DOI: 10.1016/j.jtbi.2012.03.039] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2011] [Revised: 03/27/2012] [Accepted: 03/29/2012] [Indexed: 11/25/2022]
Abstract
Gene expression is the main cause for the existence of various phenotypes. Through this procedure, the information stored in DNA rises to the phenotype. Essentially, gene expression is dependent upon the successful binding of transcription factors (TFs) - a specific type of proteins - to explicit positions in its upstream, TF binding sites (TFBSs). Unfortunately, finding these TFBSs is costly and laborious; therefore, discovering TFBSs computationally is a significant problem that many researches endeavor to solve. In this paper, a new TFBS discovery method is presented by considering known biological facts about TFBSs. The input to this method includes sequences with arbitrary lengths and the output comprises positions that tend to be TFBS. Through the application of previous methods along with a method that focuses on biological and simulated datasets, it is shown that this method achieves higher accuracy in discovering TFBSs.
Collapse
Affiliation(s)
- Ghasem Mahdevar
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
| | | | | |
Collapse
|
7
|
Batista MV, Ferreira TA, Freitas AC, Balbino VQ. An entropy-based approach for the identification of phylogenetically informative genomic regions of Papillomavirus. INFECTION GENETICS AND EVOLUTION 2011; 11:2026-33. [DOI: 10.1016/j.meegid.2011.09.013] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2011] [Revised: 09/09/2011] [Accepted: 09/14/2011] [Indexed: 11/17/2022]
|
8
|
Kim JT, Gewehr JE, Martinetz T. BINDING MATRIX: A NOVEL APPROACH FOR BINDING SITE RECOGNITION. J Bioinform Comput Biol 2011; 2:289-307. [PMID: 15297983 DOI: 10.1142/s0219720004000569] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2003] [Revised: 12/19/2003] [Accepted: 12/19/2003] [Indexed: 11/18/2022]
Abstract
Recognition of protein-DNA binding sites in genomic sequences is a crucial step for discovering biological functions of genomic sequences. Explosive growth in availability of sequence information has resulted in a demand for binding site detection methods with high specificity. The motivation of the work presented here is to address this demand by a systematic approach based on Maximum Likelihood Estimation. A general framework is developed in which a large class of binding site detection methods can be described in a uniform and consistent way. Protein-DNA binding is determined by binding energy, which is an approximately linear function within the space of sequence words. All matrix based binding word detectors can be regarded as different linear classifiers which attempt to estimate the linear separation implied by the binding energy function. The standard approaches of consensus sequences and profile matrices are described using this framework. A maximum likelihood approach for determining this linear separation leads to a novel matrix type, called the binding matrix. The binding matrix is the most specific matrix based classifier which is consistent with the input set of known binding words. It achieves significant improvements in specificity compared to other matrices. This is demonstrated using 95 sets of experimentally determined binding words provided by the TRANSFAC database.
Collapse
|
9
|
Stochastic oscillations in genetic regulatory networks: application to microarray experiments. EURASIP JOURNAL ON BIOINFORMATICS & SYSTEMS BIOLOGY 2010:59526. [PMID: 18427584 DOI: 10.1155/bsb/2006/59526] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2006] [Revised: 06/26/2006] [Accepted: 06/27/2006] [Indexed: 11/17/2022]
Abstract
We analyze the stochastic dynamics of genetic regulatory networks using a system of nonlinear differential equations. The system of S-functions is applied to capture the role of RNA polymerase in the transcription-translation mechanism. Using probabilistic properties of chemical rate equations, we derive a system of stochastic differential equations which are analytically tractable despite the high dimension of the regulatory network. Using stationary solutions of these equations, we explain the apparently paradoxical results of some recent time-course microarray experiments where mRNA transcription levels are found to only weakly correlate with the corresponding transcription rates. Combining analytical and simulation approaches, we determine the set of relationships between the size of the regulatory network, its structural complexity, chemical variability, and spectrum of oscillations. In particular, we show that temporal variability of chemical constituents may decrease while complexity of the network is increasing. This finding provides an insight into the nature of "functional determinism" of such an inherently stochastic system as genetic regulatory network.
Collapse
|
10
|
Biomolecular information gained through in vitro evolution. Biophys Rev 2010; 2:1-11. [DOI: 10.1007/s12551-009-0021-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2009] [Accepted: 10/22/2009] [Indexed: 11/30/2022] Open
|
11
|
Information content based model for the topological properties of the gene regulatory network of Escherichia coli. J Theor Biol 2009; 263:281-94. [PMID: 19962388 DOI: 10.1016/j.jtbi.2009.11.017] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2009] [Revised: 11/21/2009] [Accepted: 11/23/2009] [Indexed: 11/22/2022]
Abstract
Gene regulatory networks (GRN) are being studied with increasingly precise quantitative tools and can provide a testing ground for ideas regarding the emergence and evolution of complex biological networks. We analyze the global statistical properties of the transcriptional regulatory network of the prokaryote Escherichia coli, identifying each operon with a node of the network. We propose a null model for this network using the content-based approach applied earlier to the eukaryote Saccharomyces cerevisiae (Balcan et al., 2007). Random sequences that represent promoter regions and binding sequences are associated with the nodes. The length distributions of these sequences are extracted from the relevant databases. The network is constructed by testing for the occurrence of binding sequences within the promoter regions. The ensemble of emergent networks yields an exponentially decaying in-degree distribution and a putative power law dependence for the out-degree distribution with a flat tail, in agreement with the data. The clustering coefficient, degree-degree correlation, rich club coefficient and k-core visualization all agree qualitatively with the empirical network to an extent not yet achieved by any other computational model, to our knowledge. The significant statistical differences can point the way to further research into non-adaptive and adaptive processes in the evolution of the E. coli GRN.
Collapse
|
12
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
13
|
Erill I, O'Neill MC. A reexamination of information theory-based methods for DNA-binding site identification. BMC Bioinformatics 2009; 10:57. [PMID: 19210776 PMCID: PMC2680408 DOI: 10.1186/1471-2105-10-57] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2008] [Accepted: 02/11/2009] [Indexed: 11/10/2022] Open
Abstract
Background Searching for transcription factor binding sites in genome sequences is still an open problem in bioinformatics. Despite substantial progress, search methods based on information theory remain a standard in the field, even though the full validity of their underlying assumptions has only been tested in artificial settings. Here we use newly available data on transcription factors from different bacterial genomes to make a more thorough assessment of information theory-based search methods. Results Our results reveal that conventional benchmarking against artificial sequence data leads frequently to overestimation of search efficiency. In addition, we find that sequence information by itself is often inadequate and therefore must be complemented by other cues, such as curvature, in real genomes. Furthermore, results on skewed genomes show that methods integrating skew information, such as Relative Entropy, are not effective because their assumptions may not hold in real genomes. The evidence suggests that binding sites tend to evolve towards genomic skew, rather than against it, and to maintain their information content through increased conservation. Based on these results, we identify several misconceptions on information theory as applied to binding sites, such as negative entropy, and we propose a revised paradigm to explain the observed results. Conclusion We conclude that, among information theory-based methods, the most unassuming search methods perform, on average, better than any other alternatives, since heuristic corrections to these methods are prone to fail when working on real data. A reexamination of information content in binding sites reveals that information content is a compound measure of search and binding affinity requirements, a fact that has important repercussions for our understanding of binding site evolution.
Collapse
Affiliation(s)
- Ivan Erill
- Department of Biological Sciences, University of Maryland-Baltimore County, Baltimore, MD, USA.
| | | |
Collapse
|
14
|
Vyhlidal CA, Rogan PK, Leeder JS. Development and refinement of pregnane X receptor (PXR) DNA binding site model using information theory: insights into PXR-mediated gene regulation. J Biol Chem 2004; 279:46779-86. [PMID: 15316010 DOI: 10.1074/jbc.m408395200] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
The pregnane X receptor (PXR) acts as a receptor to induce gene expression in response to structurally diverse xenobiotics through binding as a heterodimer with the 9-cis retinoic acid receptor (RXR) to enhancers in target gene promoters. We identified and estimated the affinities of novel PXR/RXR binding sites in regulated genes and additional genomic targets of PXR with an information theory-based model of the PXR/RXR binding site. Our initial PXR/RXR model, the result of the alignment of 15 previously characterized binding sites, was used to scan the promoters of known PXR target genes. Sites from these genes, with information contents of >8 bits bound by PXR/RXR in vitro, were used to revise the information weight matrix; this procedure was repeated by screening for progressively weaker binding sites. After three iterations of refinement, the model was based on 48 validated PXR/RXR binding sites and has an average information content (Rsequence) of 14.43 +/- 3.21 bits. A scan of the human genome predicted novel PXR/RXR binding sites in the promoters of UGT1A3 (19.78 bits at -8040 and 16.37 bits at -6930) and UGT1A6 (12.74 bits at -9216), both of which were identified previously as targets for PXR. These sites were subsequently demonstrated to specifically bind PXR/RXR in competition electrophoretic mobility shift assays. A strong PXR site was also predicted upstream of the CASP10 gene (18.69 bits at -7872) and was validated by binding studies and reporter assays as a PXR responsive element. This suggests that the PXR-mediated response extends beyond genes involved in drug biotransformation and transport.
Collapse
Affiliation(s)
- Carrie A Vyhlidal
- Section of Developmental Pharmacology and Experimental Therapeutics, Division of Pediatric Clinical Pharmacology and Medical Toxicology and Laboratory of Human Molecular Genetics, Children's Mercy Hospital and Clinics, Kansas City, Missouri 64108, USA
| | | | | |
Collapse
|