1
|
Nuel G. Moments of the Count of a Regular Expression in a Heterogeneous Random Sequence. Methodol Comput Appl Probab 2019. [DOI: 10.1007/s11009-019-09700-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
2
|
On the First k Moments of the Random Count of a Pattern in a Multistate Sequence Generated by a Markov Source. J Appl Probab 2016. [DOI: 10.1017/s0021900200007403] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
In this paper we develop an explicit formula that allows us to compute the firstkmoments of the random count of a pattern in a multistate sequence generated by a Markov source. We derive efficient algorithms that allow us to deal with any pattern (low or high complexity) in any Markov model (homogeneous or not). We then apply these results to the distribution of DNA patterns in genomic sequences, and we show that moment-based developments (namely Edgeworth's expansion and Gram-Charlier type-B series) allow us to improve the reliability of common asymptotic approximations, such as Gaussian or Poisson approximations.
Collapse
|
3
|
Nuel G. On the First k Moments of the Random Count of a Pattern in a Multistate Sequence Generated by a Markov Source. J Appl Probab 2016. [DOI: 10.1239/jap/1294170523] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper we develop an explicit formula that allows us to compute the first k moments of the random count of a pattern in a multistate sequence generated by a Markov source. We derive efficient algorithms that allow us to deal with any pattern (low or high complexity) in any Markov model (homogeneous or not). We then apply these results to the distribution of DNA patterns in genomic sequences, and we show that moment-based developments (namely Edgeworth's expansion and Gram-Charlier type-B series) allow us to improve the reliability of common asymptotic approximations, such as Gaussian or Poisson approximations.
Collapse
|
4
|
Régnier M, Furletova E, Yakovlev V, Roytberg M. Analysis of pattern overlaps and exact computation of P-values of pattern occurrences numbers: case of Hidden Markov Models. Algorithms Mol Biol 2015; 9:25. [PMID: 25648087 PMCID: PMC4307674 DOI: 10.1186/s13015-014-0025-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 11/09/2014] [Indexed: 12/02/2022] Open
Abstract
Background Finding new functional fragments in biological sequences is a challenging problem. Methods addressing this problem commonly search for clusters of pattern occurrences that are statistically significant. A measure of statistical significance is the P-value of a number of pattern occurrences, i.e. the probability to find at least S occurrences of words from a pattern in a random text of length N generated according to a given probability model. All words of the pattern are supposed to be of same length. Results We present a novel algorithm SufPref that computes an exact P-value for Hidden Markov models (HMM). The algorithm is based on recursive equations on text sets related to pattern occurrences; the equations can be used for any probability model. The algorithm inductively traverses a specific data structure, an overlap graph. The nodes of the graph are associated with the overlaps of words from . The edges are associated to the prefix and suffix relations between overlaps. An originality of our data structure is that pattern need not be explicitly represented in nodes or leaves. The algorithm relies on the Cartesian product of the overlap graph and the graph of HMM states; this approach is analogous to the automaton approach from JBCB 4: 553-569. The gain in size of SufPref data structure leads to significant improvements in space and time complexity compared to existent algorithms. The algorithm SufPref was implemented as a C++ program; the program can be used both as Web-server and a stand alone program for Linux and Windows. The program interface admits special formats to describe probability models of various types (HMM, Bernoulli, Markov); a pattern can be described with a list of words, a PSSM, a degenerate pattern or a word and a number of mismatches. It is available at http://server2.lpm.org.ru/bio/online/sf/. The program was applied to compare sensitivity and specificity of methods for TFBS prediction based on P-values computed for Bernoulli models, Markov models of orders one and two and HMMs. The experiments show that the methods have approximately the same qualities. Electronic supplementary material The online version of this article (doi:10.1186/s13015-014-0025-1) contains supplementary material, which is available to authorized users.
Collapse
|
5
|
Touzain F, Petit MA, Schbath S, El Karoui M. DNA motifs that sculpt the bacterial chromosome. Nat Rev Microbiol 2011; 9:15-26. [PMID: 21164534 DOI: 10.1038/nrmicro2477] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
During the bacterial cell cycle, the processes of chromosome replication, DNA segregation, DNA repair and cell division are coordinated by precisely defined events. Tremendous progress has been made in recent years in identifying the mechanisms that underlie these processes. A striking feature common to these processes is that non-coding DNA motifs play a central part, thus 'sculpting' the bacterial chromosome. Here, we review the roles of these motifs in the mechanisms that ensure faithful transmission of genetic information to daughter cells. We show how their chromosomal distribution is crucial for their function and how it can be analysed quantitatively. Finally, the potential roles of these motifs in bacterial chromosome evolution are discussed.
Collapse
Affiliation(s)
- Fabrice Touzain
- INRA, UMR 1319, Institut Micalis, FR-78352, Jouy-en-Josas, France
| | | | | | | |
Collapse
|
6
|
Zhai Z, Ku SY, Luan Y, Reinert G, Waterman MS, Sun F. The power of detecting enriched patterns: an HMM approach. J Comput Biol 2010; 17:581-92. [PMID: 20426691 PMCID: PMC3203519 DOI: 10.1089/cmb.2009.0218] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.
Collapse
Affiliation(s)
- Zhiyuan Zhai
- School of Mathematics, Shandong University, Jinan, Shandong, P.R. China
| | - Shih-Yen Ku
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California
| | - Yihui Luan
- School of Mathematics, Shandong University, Jinan, Shandong, P.R. China
| | - Gesine Reinert
- Department of Statistics, Oxford University, Oxford, United Kingdom
| | - Michael S. Waterman
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, P.R. China
| | - Fengzhu Sun
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, California
- TNLIST/Department of Automation, Tsinghua University, Beijing, P.R. China
| |
Collapse
|
7
|
Nuel G, Regad L, Martin J, Camproux AC. Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data. Algorithms Mol Biol 2010; 5:15. [PMID: 20205909 PMCID: PMC2828453 DOI: 10.1186/1748-7188-5-15] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2009] [Accepted: 01/26/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models. RESULTS The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence. CONCLUSIONS Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.
Collapse
Affiliation(s)
- Gregory Nuel
- LSG, Laboratoire Statistique et Génome, CNRS UMR-8071, INRA UMR-1152, University of Evry, Evry, France
- CNRS, Paris, France
- MAP5, Department of Applied Mathematics, CNRS UMR-8145, University Paris Descartes, Paris, France
| | - Leslie Regad
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| | - Juliette Martin
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MIG, Mathématique Informatique et Genome, INRA UR-1077, Jouy-en-Josas, France
- IBCP, Institut de Biologie et Chimie des Protéines, IFR 128, CNRS UMR 5086, University of Lyon 1, Lyon, France
| | - Anne-Claude Camproux
- EBGM, Equipe de Bioinformatique Génomique et Moleculaire, INSERM UMRS-726, University Paris Diderot, Paris, France
- MTi, Molecules Thérapeutique in silico, INSERM UMRS-973, University Paris Diderot, Paris, France
| |
Collapse
|
8
|
On the Normal Approximation for the Distribution of the Number of Simple or Compound Patterns in a Random Sequence of Multi-state Trials. Methodol Comput Appl Probab 2007. [DOI: 10.1007/s11009-007-9019-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
9
|
Nuel G. Pattern statistics on Markov chains and sensitivity to parameter estimation. Algorithms Mol Biol 2006; 1:17. [PMID: 17044916 PMCID: PMC1647278 DOI: 10.1186/1748-7188-1-17] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2006] [Accepted: 10/17/2006] [Indexed: 11/21/2022] Open
Abstract
Background: In order to compute pattern statistics in computational biology a Markov model is commonly used to take into account the sequence composition. Usually its parameter must be estimated. The aim of this paper is to determine how sensitive these statistics are to parameter estimation, and what are the consequences of this variability on pattern studies (finding the most over-represented words in a genome, the most significant common words to a set of sequences,...). Results: In the particular case where pattern statistics (overlap counting only) computed through binomial approximations we use the delta-method to give an explicit expression of σ, the standard deviation of a pattern statistic. This result is validated using simulations and a simple pattern study is also considered. Conclusion: We establish that the use of high order Markov model could easily lead to major mistakes due to the high sensitivity of pattern statistics to parameter estimation.
Collapse
Affiliation(s)
- Grégory Nuel
- Laboratoire Statistique et Génome, University of Evry, CNRS (8071), INRA(1152), 523, place des terrasses de I'Agora, 91034 Evry CEDEX, France.
| |
Collapse
|
10
|
Nuel G. Effective p-value computations using Finite Markov Chain Imbedding (FMCI): application to local score and to pattern statistics. Algorithms Mol Biol 2006; 1:5. [PMID: 16722531 PMCID: PMC1479348 DOI: 10.1186/1748-7188-1-5] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2006] [Accepted: 04/07/2006] [Indexed: 11/21/2022] Open
Abstract
The technique of Finite Markov Chain Imbedding (FMCI) is a classical approach to complex combinatorial problems related to sequences. In order to get efficient algorithms, it is known that such approaches need to be first rewritten using recursive relations. We propose here to give here a general recursive algorithms allowing to compute in a numerically stable manner exact Cumulative Distribution Function (CDF) or complementary CDF (CCDF). These algorithms are then applied in two particular cases: the local score of one sequence and pattern statistics. In both cases, asymptotic developments are derived. For the local score, our new approach allows for the very first time to compute exact p-values for a practical study (finding hydrophobic segments in a protein database) where only approximations were available before. In this study, the asymptotic approximations appear to be completely unreliable for 99.5% of the considered sequences. Concerning the pattern statistics, the new FMCI algorithms dramatically outperform the previous ones as they are more reliable, easier to implement, faster and with lower memory requirements.
Collapse
Affiliation(s)
- Grégory Nuel
- Laboratoire Statistique et Génome, UEVE, CNRS (8071), INRA (1152), Evry, France.
| |
Collapse
|
11
|
Abstract
SUMMARY S-SPatt allows the counting of patterns occurrences in text files and, assuming these texts are generated from a random Markovian source, the computation of the P-value of a given observation using a simple binomial approximation.
Collapse
Affiliation(s)
- Grégory Nuel
- Laboratoire Statistique et Génome 523 place des terrasses de l'Agora, 91000 Evry, France.
| |
Collapse
|