1
|
Mohanty S, Pattnaik PK, Al-Absi AA, Kang DK. A Review on Planted ( l, d) Motif Discovery Algorithms for Medical Diagnose. SENSORS (BASEL, SWITZERLAND) 2022; 22:1204. [PMID: 35161949 PMCID: PMC8838483 DOI: 10.3390/s22031204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Revised: 01/19/2022] [Accepted: 01/31/2022] [Indexed: 11/16/2022]
Abstract
Personalized diagnosis of chronic disease requires capturing the continual pattern across the biological sequence. This repeating pattern in medical science is called "Motif". Motifs are the short, recurring patterns of biological sequences that are supposed signify some health disorder. They identify the binding sites for transcription factors that modulate and synchronize the gene expression. These motifs are important for the analysis and interpretation of various health issues like human disease, gene function, drug design, patient's conditions, etc. Searching for these patterns is an important step in unraveling the mechanisms of gene expression properly diagnose and treat chronic disease. Thus, motif identification has a vital role in healthcare studies and attracts many researchers. Numerous approaches have been characterized for the motif discovery process. This article attempts to review and analyze fifty-four of the most frequently found motif discovery processes/algorithms from different approaches and summarizes the discussion with their strengths and weaknesses.
Collapse
Affiliation(s)
- Satarupa Mohanty
- School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar 751024, India; (S.M.); (P.K.P.)
| | - Prasant Kumar Pattnaik
- School of Computer Engineering, KIIT Deemed to be University, Bhubaneswar 751024, India; (S.M.); (P.K.P.)
| | | | - Dae-Ki Kang
- Department of Computer & Information Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Korea
| |
Collapse
|
2
|
|
3
|
Saidi R, Dhifli W, Maddouri M, Mephu Nguifo E. Efficiently Mining Recurrent Substructures from Protein Three-Dimensional Structure Graphs. J Comput Biol 2019; 26:561-571. [DOI: 10.1089/cmb.2018.0171] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom
| | - Wajdi Dhifli
- University of Lille, Faculty of Pharmaceutical and Biological Sciences, EA2694, F-59000 Lille, France
| | - Mondher Maddouri
- University of Jeddah, School of Business, Jeddah, Kingdom of Saudi Arabia
| | | |
Collapse
|
4
|
Ayad LAK, Pissis SPP, Retha A. libFLASM: a software library for fixed-length approximate string matching. BMC Bioinformatics 2016; 17:454. [PMID: 27832739 PMCID: PMC5103500 DOI: 10.1186/s12859-016-1320-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2016] [Accepted: 11/03/2016] [Indexed: 01/06/2023] Open
Abstract
Background Approximate string matching is the problem of finding all factors of a given text that are at a distance at most k from a given pattern. Fixed-length approximate string matching is the problem of finding all factors of a text of length n that are at a distance at most k from any factor of length ℓ of a pattern of length m. There exist bit-vector techniques to solve the fixed-length approximate string matching problem in time \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$\mathcal {O}(m\lceil \ell /w \rceil n)$\end{document}O(m⌈ℓ/w⌉n) and space \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$\mathcal {O}(m\lceil \ell /w\rceil)$\end{document}O(m⌈ℓ/w⌉) under the edit and Hamming distance models, where w is the size of the computer word; as such these techniques are independent of the distance threshold k or the alphabet size. Fixed-length approximate string matching is a generalisation of approximate string matching and, hence, has numerous direct applications in computational molecular biology and elsewhere. Results We present and make available libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching under both the edit and the Hamming distance models. Moreover we describe how fixed-length approximate string matching is applied to solve real problems by incorporating libFLASM into established applications for multiple circular sequence alignment as well as single and structured motif extraction. Specifically, we describe how it can be used to improve the accuracy of multiple circular sequence alignment in terms of the inferred likelihood-based phylogenies; and we also describe how it is used to efficiently find motifs in molecular sequences representing regulatory or functional regions. The comparison of the performance of the library to other algorithms show how it is competitive, especially with increasing distance thresholds. Conclusions Fixed-length approximate string matching is a generalisation of the classic approximate string matching problem. We present libFLASM, a free open-source C++ software library for solving fixed-length approximate string matching. The extensive experimental results presented here suggest that other applications could benefit from using libFLASM, and thus further maintenance and development of libFLASM is desirable.
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Informatics, King's College London, The Strand, London, WC2R 2LS, UK
| | - Solon P P Pissis
- Department of Informatics, King's College London, The Strand, London, WC2R 2LS, UK.
| | - Ahmad Retha
- Department of Informatics, King's College London, The Strand, London, WC2R 2LS, UK
| |
Collapse
|
5
|
Pari A, Baraani A, Parseh S. NSAMD: A new approach to discover structured contiguous substrings in sequence datasets using Next-Symbol-Array. Comput Biol Chem 2016; 64:384-395. [PMID: 27620380 DOI: 10.1016/j.compbiolchem.2016.09.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2015] [Revised: 08/25/2016] [Accepted: 09/03/2016] [Indexed: 10/21/2022]
Abstract
In many sequence data mining applications, the goal is to find frequent substrings. Some of these applications like extracting motifs in protein and DNA sequences are looking for frequently occurring approximate contiguous substrings called simple motifs. By approximate we mean that some mismatches are allowed during similarity test between substrings, and it helps to discover unknown patterns. Structured motifs in DNA sequences are frequent structured contiguous substrings which contains two or more simple motifs. There are some works that have been done to find simple motifs but these works have problems such as low scalability, high execution time, no guarantee to find all patterns, and low flexibility in adaptation to other application. The Flame is the only algorithm that can find all unknown structured patterns in a dataset and has solved most of these problems but its scalability for very large sequences is still weak. In this research a new approach named Next-Symbol-Array based Motif Discovery (NSAMD) is represented to improve scalability in extracting all unknown simple and structured patterns. To reach this goal a new data structure has been presented called Next-Symbol-Array. This data structure makes change in how to find patterns by NSAMD in comparison with Flame and helps to find structured motif faster. Proposed algorithm is as accurate as Flame and extracts all existing patterns in dataset. Performance comparisons show that NSAMD outperforms Flame in extracting structured motifs in both execution time (51% faster) and memory usage (more than 99%). Proposed algorithm is slower in extracting simple motifs but considerable improvement in memory usage (more than 99%) makes NSAMD more scalable than Flame. This advantage of NSAMD is very important in biological applications in which very large sequences are applied.
Collapse
Affiliation(s)
- Abdolvahed Pari
- Department of Software Engineering, University of Isfahan, Isfahan, Iran.
| | - Ahmad Baraani
- Department of Software Engineering, University of Isfahan, Isfahan, Iran.
| | - Saeed Parseh
- Department of Software Engineering, University of Isfahan, Isfahan, Iran.
| |
Collapse
|
6
|
Leoncini M, Montangero M, Pellegrini M, Tillan KP. CMStalker: A Combinatorial Tool for Composite Motif Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1123-1136. [PMID: 26451824 DOI: 10.1109/tcbb.2014.2359444] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Controlling the differential expression of many thousands different genes at any given time is a fundamental task of metazoan organisms and this complex orchestration is controlled by the so-called regulatory genome encoding complex regulatory networks: several Transcription Factors bind to precise DNA regions, so to perform in a cooperative manner a specific regulation task for nearby genes. The in silico prediction of these binding sites is still an open problem, notwithstanding continuous progress and activity in the last two decades. In this paper, we describe a new efficient combinatorial approach to the problem of detecting sets of cooperating binding sites in promoter sequences, given in input a database of Transcription Factor Binding Sites encoded as Position Weight Matrices. We present CMStalker, a software tool for composite motif discovery which embodies a new approach that combines a constraint satisfaction formulation with a parameter relaxation technique to explore efficiently the space of possible solutions. Extensive experiments with 12 data sets and 11 state-of-the-art tools are reported, showing an average value of the correlation coefficient of 0.54 (against a value 0.41 of the closest competitor). This improvements in output quality due to CMStalker is statistically significant.
Collapse
|
7
|
Federico M, Leoncini M, Montangero M, Valente P. Direct vs 2-stage approaches to structured motif finding. Algorithms Mol Biol 2012; 7:20. [PMID: 22908910 PMCID: PMC3564690 DOI: 10.1186/1748-7188-7-20] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2011] [Accepted: 07/25/2012] [Indexed: 12/03/2022] Open
Abstract
Background The notion of DNA motif is a mathematical abstraction used to model regions of the DNA (known as Transcription Factor Binding Sites, or TFBSs) that are bound by a given Transcription Factor to regulate gene expression or repression. In turn, DNA structured motifs are a mathematical counterpart that models sets of TFBSs that work in concert in the gene regulations processes of higher eukaryotic organisms. Typically, a structured motif is composed of an ordered set of isolated (or simple) motifs, separated by a variable, but somewhat constrained number of “irrelevant” base-pairs. Discovering structured motifs in a set of DNA sequences is a computationally hard problem that has been addressed by a number of authors using either a direct approach, or via the preliminary identification and successive combination of simple motifs. Results We describe a computational tool, named SISMA, for the de-novo discovery of structured motifs in a set of DNA sequences. SISMA is an exact, enumerative algorithm, meaning that it finds all the motifs conforming to the specifications. It does so in two stages: first it discovers all the possible component simple motifs, then combines them in a way that respects the given constraints. We developed SISMA mainly with the aim of understanding the potential benefits of such a 2-stage approach w.r.t. direct methods. In fact, no 2-stage software was available for the general problem of structured motif discovery, but only a few tools that solved restricted versions of the problem. We evaluated SISMA against other published tools on a comprehensive benchmark made of both synthetic and real biological datasets. In a significant number of cases, SISMA outperformed the competitors, exhibiting a good performance also in most of the cases in which it was inferior. Conclusions A reflection on the results obtained lead us to conclude that a 2-stage approach can be implemented with many advantages over direct approaches. Some of these have to do with greater modularity, ease of parallelization, and the possibility to perform adaptive searches of structured motifs. As another consideration, we noted that most hard instances for SISMA were easy to detect in advance. In these cases one may initially opt for a direct method; or, as a viable alternative in most laboratories, one could run both direct and 2-stage tools in parallel, halting the computations when the first halts.
Collapse
|
8
|
|
9
|
Saidi R, Maddouri M, Mephu Nguifo E. Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics 2010; 11:175. [PMID: 20377887 PMCID: PMC2868007 DOI: 10.1186/1471-2105-11-175] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2009] [Accepted: 04/08/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND This paper deals with the preprocessing of protein sequences for supervised classification. Motif extraction is one way to address that task. It has been largely used to encode biological sequences into feature vectors to enable using well-known machine-learning classifiers which require this format. However, designing a suitable feature space, for a set of proteins, is not a trivial task. For this purpose, we propose a novel encoding method that uses amino-acid substitution matrices to define similarity between motifs during the extraction step. RESULTS In order to demonstrate the efficiency of such approach, we compare several encoding methods using some machine learning classifiers. The experimental results showed that our encoding method outperforms other ones in terms of classification accuracy and number of generated attributes. We also compared the classifiers in term of accuracy. Results indicated that SVM generally outperforms the other classifiers with any encoding method. We showed that SVM, coupled with our encoding method, can be an efficient protein classification system. In addition, we studied the effect of the substitution matrices variation on the quality of our method and hence on the classification quality. We noticed that our method enables good classification accuracies with all the substitution matrices and that the variances of the obtained accuracies using various substitution matrices are slight. However, the number of generated features varies from a substitution matrix to another. Furthermore, the use of already published datasets allowed us to carry out a comparison with several related works. CONCLUSIONS The outcomes of our comparative experiments confirm the efficiency of our encoding method to represent protein sequences in classification tasks.
Collapse
Affiliation(s)
- Rabie Saidi
- LIMOS - Blaise Pascal University - Clermont University, BP 10448, Clermont-Ferrand 63000, France
- LIMOS - CNRS UMR 6158, Aubière 63173, France
- Department of Computer Science - FSJ - University of Jendouba, UMA Street, Jendouba 8100, Tunisia
- URPAH - FST - University of Tunis El Manar, Academic Campus, Tunis 2092, Tunisia
| | - Mondher Maddouri
- URPAH - FST - University of Tunis El Manar, Academic Campus, Tunis 2092, Tunisia
- Department of Computer Science - FSG - University of Gafsa, Campus of Sidi Ahmed Zarroug, Gafsa 2112, Tunisia
| | - Engelbert Mephu Nguifo
- LIMOS - Blaise Pascal University - Clermont University, BP 10448, Clermont-Ferrand 63000, France
- LIMOS - CNRS UMR 6158, Aubière 63173, France
| |
Collapse
|
10
|
|
11
|
Zaki MJ, Karypis G, Yang J. Data Mining in Bioinformatics (BIOKDD). Algorithms Mol Biol 2007; 2:4. [PMID: 17428327 PMCID: PMC1852315 DOI: 10.1186/1748-7188-2-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2006] [Accepted: 04/11/2007] [Indexed: 11/10/2022] Open
Affiliation(s)
- Mohammed J Zaki
- Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
| | - George Karypis
- Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA
| | - Jiong Yang
- Electrical Engineering and Computer Science Department, Case Western Reserve University, Cleveland, OH 44106, USA
| |
Collapse
|
12
|
Zhang Y, Zaki MJ. SMOTIF: efficient structured pattern and profile motif search. Algorithms Mol Biol 2006; 1:22. [PMID: 17118189 PMCID: PMC1679804 DOI: 10.1186/1748-7188-1-22] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2006] [Accepted: 11/21/2006] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an efficient algorithm, called SMOTIF, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, SMOTIF searches the sequences for all occurrences of the motif. Potential applications include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences. RESULTS SMOTIF can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMARTFINDER, a state-of-the-art algorithm for structured motif search. Experimental results show that SMOTIF is about 7 times faster and consumes 100 times less memory than SMARTFINDER. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites. CONCLUSION SMOTIF is a useful and efficient tool in searching for structured pattern and profile motifs. The algorithm is available as open-source at: http://www.cs.rpi.edu/~zaki/software/sMotif/.
Collapse
Affiliation(s)
- Yongqiang Zhang
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
| | - Mohammed J Zaki
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
| |
Collapse
|