51
|
Gangal R, Sharma P. Human pol II promoter prediction: time series descriptors and machine learning. Nucleic Acids Res 2005; 33:1332-6. [PMID: 15741185 PMCID: PMC552959 DOI: 10.1093/nar/gki271] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2004] [Revised: 02/08/2005] [Accepted: 02/08/2005] [Indexed: 11/14/2022] Open
Abstract
Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters.
Collapse
Affiliation(s)
- Rajeev Gangal
- SciNova Technologies Pvt. Ltd528/43 Vishwashobha, Adjacent to Modi Ganpati, Narayan Peth, Pune 411030, Maharashtra, India
| | - Pankaj Sharma
- SciNova Technologies Pvt. Ltd528/43 Vishwashobha, Adjacent to Modi Ganpati, Narayan Peth, Pune 411030, Maharashtra, India
| |
Collapse
|
52
|
|
53
|
Vinga S, Almeida JS. Rényi continuous entropy of DNA sequences. J Theor Biol 2004; 231:377-88. [PMID: 15501469 DOI: 10.1016/j.jtbi.2004.06.030] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2004] [Accepted: 06/30/2004] [Indexed: 11/20/2022]
Abstract
Entropy measures of DNA sequences estimate their randomness or, inversely, their repeatability. L-block Shannon discrete entropy accounts for the empirical distribution of all length-L words and has convergence problems for finite sequences. A new entropy measure that extends Shannon's formalism is proposed. Renyi's quadratic entropy calculated with Parzen window density estimation method applied to CGR/USM continuous maps of DNA sequences constitute a novel technique to evaluate sequence global randomness without some of the former method drawbacks. The asymptotic behaviour of this new measure was analytically deduced and the calculation of entropies for several synthetic and experimental biological sequences was performed. The results obtained were compared with the distributions of the null model of randomness obtained by simulation. The biological sequences have shown a different p-value according to the kernel resolution of Parzen's method, which might indicate an unknown level of organization of their patterns. This new technique can be very useful in the study of DNA sequence complexity and provide additional tools for DNA entropy estimation. The main MATLAB applications developed and additional material are available at the webpage . Specialized functions can be obtained from the authors.
Collapse
Affiliation(s)
- Susana Vinga
- Biomathematics Group, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, R. Qta. Grande 6, 2780-156 Oeiras, Portugal.
| | | |
Collapse
|
54
|
|
55
|
Pöschel T, Ebeling W, Frömmel C, Ramírez R. Correction algorithm for finite sample statistics. THE EUROPEAN PHYSICAL JOURNAL. E, SOFT MATTER 2003; 12:531-541. [PMID: 15007750 DOI: 10.1140/epje/e2004-00025-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Assume in a sample of size M one finds M(i) representatives of species i with i = 1..N*. The normalized frequency pi* triple bond Mi/M, based on the finite sample, may deviate considerably from the true probabilities p(i). We propose a method to infer rank-ordered true probabilities r(i) from measured frequencies M(i). We show that the rank-ordered probabilities provide important informations on the system, e.g., the true number of species, the Shannon- and the Renyi-entropies.
Collapse
Affiliation(s)
- T Pöschel
- Humboldt Universität zu Berlin, Charité, Institut für Biochemie, Monbijoustrasse 2, 10117, Berlin, Germany.
| | | | | | | |
Collapse
|
56
|
Abstract
It is probable that, increasingly, genome investigations are going to be based on statistical formalization. This review summarizes the state of art and potentiality of using statistics in microbial genome analysis. First, I focus on recent advances in functional genomics, such as finding genes and operons, identifying gene conversion events, detecting DNA replication origins and analysing regulatory sites. Then I describe how to use phylogenetic methods in genome analysis and methods for genome-wide scanning for positively selected amino acids. I conclude with speculations on the future course of genome statistical modeling.
Collapse
Affiliation(s)
- Pietro Liò
- Department of Zoology, University of Cambridge, UK.
| |
Collapse
|
57
|
Kim JT, Martinetz T, Polani D. Bioinformatic principles underlying the information content of transcription factor binding sites. J Theor Biol 2003; 220:529-44. [PMID: 12623284 DOI: 10.1006/jtbi.2003.3153] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Empirically, it has been observed in several cases that the information content of transcription factor binding site sequences (R(sequence)) approximately equals the information content of binding site positions (R(frequency)). A general framework for formal models of transcription factors and binding sites is developed to address this issue. Measures for information content in transcription factor binding sites are revisited and theoretic analyses are compared on this basis. These analyses do not lead to consistent results. A comparative review reveals that these inconsistent approaches do not include a transcription factor state space. Therefore, a state space for mathematically representing transcription factors with respect to their binding site recognition properties is introduced into the modelling framework. Analysis of the resulting comprehensive model shows that the structure of genome state space favours equality of R(sequence) and R(frequency) indeed, but the relation between the two information quantities also depends on the structure of the transcription factor state space. This might lead to significant deviations between R(sequence) and R(frequency). However, further investigation and biological arguments show that the effects of the structure of the transcription factor state space on the relation of R(sequence) and R(frequency) are strongly limited for systems which are autonomous in the sense that all DNA-binding proteins operating on the genome are encoded in the genome itself. This provides a theoretical explanation for the empirically observed equality.
Collapse
Affiliation(s)
- Jan T Kim
- Institut für Neuro- und Bioinformatik, Seelandstrasse 1a, 23569 Lübeck, Germany.
| | | | | |
Collapse
|
58
|
Affiliation(s)
- Miquel Porta
- Institut Municipal d'Investigació Mèdica (IMIM-IMAS), Universitat Autònoma de Barcelona, Spain.
| |
Collapse
|
59
|
Abstract
Our thesis is that the DNA composition and structure of genomes are selected in part by mutation bias (GC pressure) and in part by ecology. To illustrate this point, we compare and contrast the oligonucleotide composition and the mosaic structure in 36 complete genomes and in 27 long genomic sequences from archaea and eubacteria. We report the following findings (1) High-GC-content genomes show a large underrepresentation of short distances between G(n) and C(n) homopolymers with respect to distances between A(n) and T(n) homopolymers; we discuss selection versus mutation bias hypotheses. (2) The oligonucleotide compositions of the genomes of Neisseria (meningitidis and gonorrhoea), Helicobacter pylori and Rhodobacter capsulatus are more biased than the other sequenced genomes. (3) The genomes of free-living species or nonchronic pathogens show more mosaic-like structure than genomes of chronic pathogens or intracellular symbionts. (4) Genome mosaicity of intracellular parasites has a maximum corresponding to the average gene length; in the genomes of free-living and nonchronic pathogens the maximum occurs at larger length scales. This suggests that free-living species can incorporate large pieces of DNA from the environment, whereas for intracellular parasites there are recombination events between homologous genes. We discuss the consequences in terms of evolution of genome size. (5) Intracellular symbionts and obligate pathogens show small, but not zero, amount of chromosome mosaicity, suggesting that recombination events occur in these species.
Collapse
Affiliation(s)
- Pietro Liò
- Department of Zoology, University of Cambridge, United Kingdom.
| |
Collapse
|
60
|
Abstract
The complexity of large sets of non-redundant protein sequences is measured. This is done by estimating the Shannon entropy as well as applying compression algorithms to estimate the algorithmic complexity. The estimators are also applied to randomly generated surrogates of the protein data. Our results show that proteins are fairly close to random sequences. The entropy reduction due to correlations is only about 1%. However, precise estimations of the entropy of the source are not possible due to finite sample effects. Compression algorithms also indicate that the redundancy is in the order of 1%. These results confirm the idea that protein sequences can be regarded as slightly edited random strings. We discuss secondary structure and low-complexity regions as causes of the redundancy observed. The findings are related to numerical and biochemical experiments with random polypeptides.
Collapse
Affiliation(s)
- O Weiss
- Institute for Theoretical Biology, Humboldt University Berlin, Invalidenstr. 43, Berlin, D-10115, Germany
| | | | | |
Collapse
|
61
|
Rojdestvenski I, Cottam MG. Diagrammatic approach to calculation of the fluctuation correlation matrix in a metabolic system. Biosystems 2000; 56:63-73. [PMID: 10880855 DOI: 10.1016/s0303-2647(00)00076-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
We present here a simple diagrammatic approach for the time evolution of the fluctuations in metabolite concentrations around the steady state. A fluctuation correlation matrix is introduced to characterise the response in the concentrations of metabolites to a singular initial fluctuation in one of the metabolites. We show how the temporal evolution of the correlation matrix can be represented in the form of a series with individual terms corresponding to pathways on a metabolic graph. The basic properties of such graphs are studied and it is shown how each term in the series can be evaluated. A Monte-Carlo procedure is outlined to calculate the fluctuation correlation matrix. We discuss various properties of the graphical representation and discuss links to information theory that arise from it.
Collapse
Affiliation(s)
- I Rojdestvenski
- Department of Plant Physiology, Umea University, 90187, Umea, Sweden.
| | | |
Collapse
|
62
|
Rojdestvenski I, Cottam MG. Mapping of statistical physics to information theory with application to biological systems. J Theor Biol 2000; 202:43-54. [PMID: 10623498 DOI: 10.1006/jtbi.1999.1042] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
The problem of achieving a mapping of formalisms in statistical physics and theoretical biology to information theory is discussed using an example for canonical ensembles. We extend the meaning of the Handscomb Monte-Carlo method to a general recipe for the transformation from a "configuration" space to a "sentence" space. The ensemble of "sentences" and its corresponding source uncertainty function are introduced. A possible mapping procedure based on a generalization of the Handscomb representation is described. For a biological illustration, we present a way to introduce a pathway representation to describe metabolic processes in living systems.
Collapse
Affiliation(s)
- I Rojdestvenski
- Department of Plant Physiology, Umea University, Umea, 90187, Sweden.
| | | |
Collapse
|
63
|
Abstract
A linguistic complexity measure was applied to the complete genomes of HIV-1, Escherichia coli, Bacillus subtilis, Haemophilus influenzae, Mycoplasma genitalium, and to long human and yeast genomic fragments. Complexity values averaged over entire genomic sequences were compared, as were predicted average values of intrinsic DNA curvature. We found that both the most curved and the least complex fragments are located preferentially in non-coding parts of the genome. Analysis of location of the most curved and the simplest regions in bacteria showed that the low-complexity segments are preferentially located in close proximity to the highly curved sequences, which are, in turn, placed from 100 to 200 bases upstream to the start of the nearest coding sequence. We conclude that the parallel analysis of sequence complexity and DNA curvature might provide important information about sequence-structure-function relationship in genomes.
Collapse
Affiliation(s)
- A Gabrielian
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|