1
|
Pizzi C, Ornamenti M, Spangaro S, Rombo SE, Parida L. Efficient Algorithms for Sequence Analysis with Entropic Profiles. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:117-128. [PMID: 28113780 DOI: 10.1109/tcbb.2016.2620143] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.
Collapse
|
2
|
Comin M, Verzotto D. Beyond Fixed-Resolution Alignment-Free Measures for Mammalian Enhancers Sequence Comparison. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:628-637. [PMID: 26356333 DOI: 10.1109/tcbb.2014.2306830] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
UNLABELLED The cell-type diversity is to a large degree driven by transcription regulation, i.e., enhancers. It has been recently shown that in high-level eukaryotes enhancers rarely work alone, instead they collaborate by forming clusters of cis-regulatory modules (CRMs). Even if the binding of transcription factors is sequence-specific, the identification of functionally similar enhancers is very difficult. A similarity measure to detect related regulatory sequences is crucial to understand functional correlation between two enhancers. This will allow large-scale analyses, clustering and genome-wide classifications. In this paper we present Under2, a parameter-free alignment-free statistic based on variable-length words. As opposed to traditional alignment-free methods, which are based on fixed-length patterns or, in other words, tied to a fixed resolution, our statistic is built upon variable-length words, and thus multiple resolutions are allowed. This will capture the great variability of lengths of CRMs. We evaluate several alignment-free statistics on simulated data and real ChIP-seq sequences. The new statistic is highly successful in discriminating functionally related enhancers and, in almost all experiments, it outperforms fixed-resolution methods. Finally, experiments on mouse enhancers show that Under2 can separate enhancers active in different tissues. AVAILABILITY http://www.dei.unipd.it/~ciompin/main/UnderIICRMS.html.
Collapse
|
3
|
|
4
|
Panni S, Rombo SE. Searching for repetitions in biological networks: methods, resources and tools. Brief Bioinform 2013; 16:118-36. [PMID: 24300112 DOI: 10.1093/bib/bbt084] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
We present here a compact overview of the data, models and methods proposed for the analysis of biological networks based on the search for significant repetitions. In particular, we concentrate on three problems widely studied in the literature: 'network alignment', 'network querying' and 'network motif extraction'. We provide (i) details of the experimental techniques used to obtain the main types of interaction data, (ii) descriptions of the models and approaches introduced to solve such problems and (iii) pointers to both the available databases and software tools. The intent is to lay out a useful roadmap for identifying suitable strategies to analyse cellular data, possibly based on the joint use of different interaction data types or analysis techniques.
Collapse
|
5
|
|
6
|
Comin M, Verzotto D. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms Mol Biol 2012; 7:34. [PMID: 23216990 PMCID: PMC3549825 DOI: 10.1186/1748-7188-7-34] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2012] [Accepted: 11/29/2012] [Indexed: 11/24/2022] Open
Abstract
Background With the progress of modern sequencing technologies a large number of complete genomes are now available. Traditionally the comparison of two related genomes is carried out by sequence alignment. There are cases where these techniques cannot be applied, for example if two genomes do not share the same set of genes, or if they are not alignable to each other due to low sequence similarity, rearrangements and inversions, or more specifically to their lengths when the organisms belong to different species. For these cases the comparison of complete genomes can be carried out only with ad hoc methods that are usually called alignment-free methods. Methods In this paper we propose a distance function based on subword compositions called Underlying Approach (UA). We prove that the matching statistics, a popular concept in the field of string algorithms able to capture the statistics of common words between two sequences, can be derived from a small set of “independent” subwords, namely the irredundant common subwords. We define a distance-like measure based on these subwords, such that each region of genomes contributes only once, thus avoiding to count shared subwords a multiple number of times. In a nutshell, this filter discards subwords occurring in regions covered by other more significant subwords. Results The Underlying Approach (UA) builds a scoring function based on this set of patterns, called underlying. We prove that this set is by construction linear in the size of input, without overlaps, and can be efficiently constructed. Results show the validity of our method in the reconstruction of phylogenetic trees, where the Underlying Approach outperforms the current state of the art methods. Moreover, we show that the accuracy of UA is achieved with a very small number of subwords, which in some cases carry meaningful biological information. Availability http://www.dei.unipd.it/∼ciompin/main/underlying.html
Collapse
|
7
|
Cunial F, Apostolico A. Phylogeny Construction with Rigid Gapped Motifs. J Comput Biol 2012; 19:911-27. [DOI: 10.1089/cmb.2012.0060] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Fabio Cunial
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| | - Alberto Apostolico
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| |
Collapse
|
8
|
Zhao X, Sze SH. Motif finding in DNA sequences based on skipping nonconserved positions in background Markov chains. J Comput Biol 2011; 18:759-70. [PMID: 21554019 DOI: 10.1089/cmb.2010.0197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
One strategy to identify transcription factor binding sites is through motif finding in upstream DNA sequences of potentially co-regulated genes. Despite extensive efforts, none of the existing algorithms perform very well. We consider a string representation that allows arbitrary ignored positions within the nonconserved portion of single motifs, and use O(2(l)) Markov chains to model the background distributions of motifs of length l while skipping these positions within each Markov chain. By focusing initially on positions that have fixed nucleotides to define core occurrences, we develop an algorithm to identify motifs of moderate lengths. We compare the performance of our algorithm to other motif finding algorithms on a few benchmark data sets, and show that significant improvement in accuracy can be obtained when the sites are sufficiently conserved within a given sample, while comparable performance is obtained when the site conservation rate is low. A software program (PosMotif ) and detailed results are available online at http://faculty.cse.tamu.edu/shsze/posmotif.
Collapse
Affiliation(s)
- Xiaoyan Zhao
- Department of Computer Science & Engineering, Texas A&M University, College Station, Texas 77843, USA
| | | |
Collapse
|
9
|
Comin M, Verzotto D. The irredundant class method for remote homology detection of protein sequences. J Comput Biol 2011; 18:1819-29. [PMID: 21548811 DOI: 10.1089/cmb.2010.0171] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The automatic classification of protein sequences into families is of great help for the functional prediction and annotation of new proteins. In this article, we present a method called Irredundant Class that address the remote homology detection problem. The best performing methods that solve this problem are string kernels, that compute a similarity function between pairs of proteins based on their subsequence composition. We provide evidence that almost all string kernels are based on patterns that are not independent, and therefore the associated similarity scores are obtained using a set of redundant features, overestimating the similarity between the proteins. To specifically address this issue, we introduce the class of irredundant common patterns. Loosely speaking, the set of irredundant common patterns is the smallest class of independent patterns that can describe all common patterns in a pair of sequences. We present a classification method based on the statistics of these patterns, named Irredundant Class. Results on benchmark data show that the Irredundant Class outperforms most of the string kernels previously proposed, and it achieves results as good as the current state-of-the-art method Local Alignment, but using the same pairwise information only once.
Collapse
Affiliation(s)
- Matteo Comin
- Department of Information Engineering, University of Padova, Padua, Italy
| | | |
Collapse
|
10
|
Grossi R, Pietracaprina A, Pisanti N, Pucci G, Upfal E, Vandin F. MADMX: A Strategy for Maximal Dense Motif Extraction. J Comput Biol 2011; 18:535-45. [DOI: 10.1089/cmb.2010.0177] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Affiliation(s)
| | | | - Nadia Pisanti
- Dipartimento di Informatica, Università di Pisa, Italy
| | - Geppino Pucci
- Dipartimento di Ingegneria dell'Informazione, Università di Padova, Italy
| | - Eli Upfal
- Department of Computer Science, Brown University, Providence, Rhode Island
| | - Fabio Vandin
- Department of Computer Science, Brown University, Providence, Rhode Island
| |
Collapse
|
11
|
Apostolico A, Comin M, Parida L. VARUN: discovering extensible motifs under saturation constraints. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:752-26. [PMID: 21030741 DOI: 10.1109/tcbb.2008.123] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The discovery of motifs in biosequences is frequently torn between the rigidity of the model on one hand and the abundance of candidates on the other hand. In particular, motifs that include wild cards or "don't cares" escalate exponentially with their number, and this gets only worse if a don't care is allowed to stretch up to some prescribed maximum length. In this paper, a notion of extensible motif in a sequence is introduced and studied, which tightly combines the structure of the motif pattern, as described by its syntactic specification, with the statistical measure of its occurrence count. It is shown that a combination of appropriate saturation conditions and the monotonicity of probabilistic scores over regions of constant frequency afford us significant parsimony in the generation and testing of candidate overrepresented motifs. A suite of software programs called Varun is described, implementing the discovery of extensible motifs of the type considered. The merits of the method are then documented by results obtained in a variety of experiments primarily targeting protein sequence families. Of equal importance seems the fact that the sets of all surprising motifs returned in each experiment are extracted faster and come in much more manageable sizes than would be obtained in the absence of saturation constraints.
Collapse
Affiliation(s)
- Alberto Apostolico
- College of Computing, Georgia Institute of Technology, Atlanta, GA 30332-0280, USA.
| | | | | |
Collapse
|
12
|
Abstract
BACKGROUND The classification of protein sequences using string algorithms provides valuable insights for protein function prediction. Several methods, based on a variety of different patterns, have been previously proposed. Almost all string-based approaches discover patterns that are not "independent, " and therefore the associated scores overcount, a multiple number of times, the contribution of patterns that cover the same region of a sequence. RESULTS In this paper we use a class of patterns, called irredundant, that is specifically designed to address this issue. Loosely speaking the set of irredundant patterns is the smallest class of "independent" patterns that can describe all common patterns in two sequences, thus they avoid overcounting. We present a novel discriminative method, called Irredundant Class, based on the statistics of irredundant patterns combined with the power of support vector machines. CONCLUSION Tests on benchmark data show that Irredundant Class outperforms most of the string algorithms previously proposed, and it achieves results as good as current state-of-the-art methods. Moreover the footprints of the most discriminative irredundant patterns can be used to guide the identification of functional regions in protein sequences.
Collapse
Affiliation(s)
- Matteo Comin
- Department of Information Engineering, University of Padova, Italy
| | - Davide Verzotto
- Department of Information Engineering, University of Padova, Italy
| |
Collapse
|
13
|
|
14
|
Grossi R, Pietracaprina A, Pisanti N, Pucci G, Upfal E, Vandin F. MADMX: A Novel Strategy for Maximal Dense Motif Extraction. LECTURE NOTES IN COMPUTER SCIENCE 2009. [DOI: 10.1007/978-3-642-04241-6_30] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
|
15
|
Zhang S, Su W, Yang J. ARCS-Motif: discovering correlated motifs from unaligned biological sequences. Bioinformatics 2008; 25:183-9. [PMID: 19073591 DOI: 10.1093/bioinformatics/btn609] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The goal of motif discovery is to detect novel, unknown, and important signals from biology sequences. In most models, the importance of a motif is equal to the sum of the similarity of every single position. In 2006, Song et al. introduced Aggregated Related Column Score (ARCS) measure which includes correlation information to the evaluation of motif importance. The paper showed that the ARCS measure is superior to other measures. Due to the complicated nature of the ARCS motif model, we cannot directly apply existing sequential motif discovery methods to find motifs with high ARCS values. RESULTS This article presents a novel mining algorithm, ARCS-Motif, to discover related sequential motifs in biological sequences. ARCS-Motif is applied to 400 PROSITE datasets and compared with five alternative methods (CONSENSUS, Gibbs sampler, MEME, SPLASH and DIALIGN-TX). ARCS-Motif outperforms all the methods in accuracy, and most of the methods in efficiency. Although SPLASH has better efficiency than ARCS-Motif, ARCS-Motif has much better accuracy than SPLASH. On average, ARCS-Motif is able to produce the motifs which are at least 10% better than the best of the alternative methods. Among the 400 PROSITE datasets, ARCS-Motif produces the best motifs for more than 200 families. Other than SPLASH, the execution time of ARCS-Motif is less than a third of that of the fastest alternative method and its execution time grows at the slowest rate with respect to the number of sequences and the average sequence among all methods.
Collapse
Affiliation(s)
- Shijie Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA
| | | | | |
Collapse
|
16
|
Abstract
Discovering topological motifs or common topologies in one or more graphs is an important as well as an interesting problem. It had been classically viewed as the subgraph isomorphism problem. This problem and its various flavors are known to be NP-Complete. However, this does not minimize the importance of solving this problem accurately in application areas such as bioinformatics or even larger network studies. The explosion in the size of the output is usually caused by isomorphisms in the motif or graph: we present a method to handle this without sacrificing the correct answers. In this paper, we apply the natural notion of maximality, used extensively in strings, to graphs and present a simple three-step approach to solving this problem completely and exactly (without resorting to heuristics). We handle the natural combinatorial explosion due to isomorphisms inherent in the problem (which could result in output size being exponential in the input size) by the use of "compact location lists." In other words, instead of enumerating k elements out of n, we use the ((n)(k)) form in an implicit manner (k immediate neighbors of a vertex out of n possible immediate neighbors). This drastically reduces the size of the output without any loss of information. The algorithm we present is linear in terms of the size of the output encoded as compact lists.
Collapse
Affiliation(s)
- Laxmi Parida
- Computational Biology Center, IBM T.J. Watson Research Center, Yorktown Heights, New York 10598, USA.
| |
Collapse
|
17
|
|
18
|
Zhang Y, Zaki MJ. EXMOTIF: efficient structured motif extraction. Algorithms Mol Biol 2006; 1:21. [PMID: 17109757 PMCID: PMC1698483 DOI: 10.1186/1748-7188-1-21] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2006] [Accepted: 11/16/2006] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Extracting motifs from sequences is a mainstay of bioinformatics. We look at the problem of mining structured motifs, which allow variable length gaps between simple motif components. We propose an efficient algorithm, called EXMOTIF, that given some sequence(s), and a structured motif template, extracts all frequent structured motifs that have quorum q. Potential applications of our method include the extraction of single/composite regulatory binding sites in DNA sequences. RESULTS EXMOTIF is efficient in terms of both time and space and is shown empirically to outperform RISO, a state-of-the-art algorithm. It is also successful in finding potential single/composite transcription factor binding sites. CONCLUSION EXMOTIF is a useful and efficient tool in discovering structured motifs, especially in DNA sequences. The algorithm is available as open-source at: http://www.cs.rpi.edu/~zaki/software/exMotif/.
Collapse
Affiliation(s)
- Yongqiang Zhang
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
| | - Mohammed J Zaki
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
| |
Collapse
|
19
|
Bridging Lossy and Lossless Compression by Motif Pattern Discovery. LECTURE NOTES IN COMPUTER SCIENCE 2006. [DOI: 10.1007/11889342_51] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/15/2023]
|
20
|
Pisanti N, Crochemore M, Grossi R, Sagot MF. Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:40-50. [PMID: 17044163 DOI: 10.1109/tcbb.2005.5] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus, smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed.
Collapse
Affiliation(s)
- Nadia Pisanti
- Dipartimento di Informatica, Université di Pisa, Italy.
| | | | | | | |
Collapse
|