1
|
Theepalakshmi P, Reddy US. Freezing firefly algorithm for efficient planted (ℓ, d) motif search. Med Biol Eng Comput 2022; 60:511-530. [PMID: 35020123 DOI: 10.1007/s11517-021-02468-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 11/06/2021] [Indexed: 10/19/2022]
Abstract
The detection of inimitable patterns (motif) occurring in a set of biological sequences could elevate new biological discoveries. Its application in recognition of transcription factors and their binding sites have demonstrated the necessity to attain knowledge of gene function, human diseases, and drug design. The literature identifies (ℓ, d) motif search as the widely studied problem in PMS (Planted Motif Search). This paper proposes an efficient optimization algorithm named "Freezing FireFly (FFF)" to solve (ℓ, d) motif search problem. The new strategy freezing such as local and global was added to increase the performance of the basic Firefly algorithm. It freezes the best possible out coming positions even in the lesser brighter one. The performance of the proposed algorithm is experienced on simulated and real datasets. The experimental results show that the proposed algorithm resolves the instance (50, 21) within 1.47 min in the simulated dataset. For real (such as ChIP-seq (Chromatin Immunoprecipitation)) and synthetic datasets, the proposed algorithm runs much faster in comparison to existing state-of-the-art optimization algorithms, including Samselect, TraverStringRef, PMS8, qPMS9, AlignACE, FMGA, and GSGA.
Collapse
Affiliation(s)
- P Theepalakshmi
- Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India.
| | - U Srinivasulu Reddy
- Machine Learning and Data Analytics Lab, Center of Excellence in Artificial Intelligence, Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu, India
| |
Collapse
|
2
|
Chen H, Chen W, Song Y, Sun L, Li X. EEG characteristics of children with attention-deficit/hyperactivity disorder. Neuroscience 2019; 406:444-456. [PMID: 30926547 DOI: 10.1016/j.neuroscience.2019.03.048] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2018] [Revised: 02/26/2019] [Accepted: 03/20/2019] [Indexed: 11/18/2022]
Abstract
The electroencephalogram (EEG) is an informative neuroimaging tool for studying attention-deficit/hyperactivity disorder (ADHD); one main goal is to characterize the EEG of children with ADHD. In this study, we employed the power spectrum, complexity and bicoherence, biomarker candidates for identifying ADHD children in a machine learning approach, to characterize resting-state EEG (rsEEG). We built support vector machine classifiers using a single type of feature, all features from a method (relative spectral power, spectral power ratio, complexity or bicoherence), or all features from all four methods. We evaluated effectiveness and performance of the classifiers using the permutation test and the area under the receiver operating characteristic curve (AUC). We analyzed the rsEEG from 50 ADHD children and 58 age-matched controls. The results show that though spectral features can be used to build a convincing model, the prediction accuracy of the model was unfortunately unstable. Bicoherence features had significant between-group differences, but classifier performance was sensitive to brain region used. rsEEG complexity of ADHD children was significantly lower than controls and may be a suitable biomarker candidate. Through a machine learning approach, 14 features from various brain regions using different methods were selected; the classifier based on these features had an AUC of 0.9158 and an accuracy of 84.59%. These findings strongly suggest that the combination of rsEEG characteristics obtained by various methods may be a tool for identifying ADHD.
Collapse
Affiliation(s)
- He Chen
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing 100875, China
| | - Wenqing Chen
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing 100875, China
| | - Yan Song
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing 100875, China
| | - Li Sun
- Peking University Sixth Hospital / Institute of Mental Health, Key Laboratory of Ministry of Health (Peking University), Beijing 100191, China
| | - Xiaoli Li
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing 100875, China.
| |
Collapse
|
3
|
Lee NK, Li X, Wang D. A comprehensive survey on genetic algorithms for DNA motif prediction. Inf Sci (N Y) 2018. [DOI: 10.1016/j.ins.2018.07.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
|
4
|
Tapan S, Wang D. A Further Study on Mining DNA Motifs Using Fuzzy Self-Organizing Maps. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:113-124. [PMID: 26068877 DOI: 10.1109/tnnls.2015.2435155] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Self-organizing map (SOM)-based motif mining, despite being a promising approach for problem solving, mostly fails to offer a consistent interpretation of clusters with respect to the mixed composition of signal and noise in the nodes. The main reason behind this shortcoming comes from the similarity metrics used in data assignment, specially designed with the biological interpretation for this domain, which are not meant to consider the inevitable noise mixture in the clusters. This limits the explicability of the majority of clusters that are supposedly noise dominated, degrading the overall system clarity in motif discovery. This paper aims to improve the explicability aspect of learning process by introducing a composite similarity function (CSF) that is specially designed for the k -mer-to-cluster similarity measure with respect to the degree of motif properties and embedded noise in the cluster. Our proposed motif finding algorithm in this paper is built on our previous work robust elicitation algorithms for discovering (READ) [1] and termed READ Deoxyribonucleic acid motifs using CSFs (READ(csf)), which performs slightly better than READ and shows some remarkable improvements over SOM-based SOMBRERO and SOMEA tools in terms of F-measure on the testing data sets. A real data set containing multiple motifs is used to explore the potential of the READ(csf) for more challenging biological data mining tasks. Visual comparisons with the verified logos extracted from JASPAR database demonstrate that our algorithm is promising to discover multiple motifs simultaneously.
Collapse
|
5
|
Zhang Y, He Y, Zheng G, Wei C. MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures. BMC Genomics 2015; 16 Suppl 7:S13. [PMID: 26099518 PMCID: PMC4474412 DOI: 10.1186/1471-2164-16-s7-s13] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Background Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically. Result Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods. Conclusion MOST+ is a fast and accurate de novo method for motif finding by integrating genomic sequence and experimental signals as clues.
Collapse
|
6
|
Tran NTL, Huang CH. A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data. Biol Direct 2014; 9:4. [PMID: 24555784 PMCID: PMC4022013 DOI: 10.1186/1745-6150-9-4] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2013] [Revised: 01/08/2014] [Accepted: 02/11/2014] [Indexed: 12/24/2022] Open
Abstract
Abstract ChIP-Seq (chromatin immunoprecipitation sequencing) has provided the advantage for finding motifs as ChIP-Seq experiments narrow down the motif finding to binding site locations. Recent motif finding tools facilitate the motif detection by providing user-friendly Web interface. In this work, we reviewed nine motif finding Web tools that are capable for detecting binding site motifs in ChIP-Seq data. We showed each motif finding Web tool has its own advantages for detecting motifs that other tools may not discover. We recommended the users to use multiple motif finding Web tools that implement different algorithms for obtaining significant motifs, overlapping resemble motifs, and non-overlapping motifs. Finally, we provided our suggestions for future development of motif finding Web tool that better assists researchers for finding motifs in ChIP-Seq data. Reviewers This article was reviewed by Prof. Sandor Pongor, Dr. Yuriy Gusev, and Dr. Shyam Prabhakar (nominated by Prof. Limsoon Wong).
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut, 371 Fairfield Way, Unit 4155, Storrs, CT 06269, USA.
| | | |
Collapse
|
7
|
Wang D, Tapan S. A robust elicitation algorithm for discovering DNA motifs using fuzzy self-organizing maps. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2013; 24:1677-1688. [PMID: 24808603 DOI: 10.1109/tnnls.2013.2275733] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
It is important to identify DNA motifs in promoter regions to understand the mechanism of gene regulation. Computational approaches for finding DNA motifs are well recognized as useful tools to biologists, which greatly help in saving experimental time and cost in wet laboratories. Self-organizing maps (SOMs), as a powerful clustering tool, have demonstrated good potential for problem solving. However, the current SOM-based motif discovery algorithms unfairly treat data samples lying around the cluster boundaries by assigning them to one of the nodes, which may result in unreliable system performance. This paper aims to develop a robust framework for discovering DNA motifs, where fuzzy SOMs, with an integration of fuzzy c-means membership functions and a standard batch-learning scheme, are employed to extract putative motifs with varying length in a recursive manner. Experimental results on eight real datasets show that our proposed algorithm outperforms the other searching tools such as SOMBRERO, SOMEA, MEME, AlignACE, and WEEDER in terms of the F-measure and algorithm reliability. It is observed that a remarkable 24.6% improvement can be achieved compared to the state-of-the-art SOMBRERO. Furthermore, our algorithm can produce a 20% and 6.6% improvement over SOMBRERO and SOMEA, respectively, in finding multiple motifs on five artificial datasets.
Collapse
|
8
|
Wang D, Tapan S. MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 2:S4. [PMID: 23282090 PMCID: PMC3521183 DOI: 10.1186/1752-0509-6-s2-s4] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Background Computational approaches for finding DNA regulatory motifs in promoter sequences are useful to biologists in terms of reducing the experimental costs and speeding up the discovery process of de novo binding sites. It is important for rule-based or clustering-based motif searching schemes to effectively and efficiently evaluate the similarity between a k-mer (a k-length subsequence) and a motif model, without assuming the independence of nucleotides in motif models or without employing computationally expensive Markov chain models to estimate the background probabilities of k-mers. Also, it is interesting and beneficial to use a priori knowledge in developing advanced searching tools. Results This paper presents a new scoring function, termed as MISCORE, for functional motif characterization and evaluation. Our MISCORE is free from: (i) any assumption on model dependency; and (ii) the use of Markov chain model for background modeling. It integrates the compositional complexity of motif instances into the function. Performance evaluations with comparison to the well-known Maximum a Posteriori (MAP) score and Information Content (IC) have shown that MISCORE has promising capabilities to separate and recognize functional DNA motifs and its instances from non-functional ones. Conclusions MISCORE is a fast computational tool for candidate motif characterization, evaluation and selection. It enables to embed priori known motif models for computing motif-to-motif similarity, which is more advantageous than IC and MAP score. In addition to these merits mentioned above, MISCORE can automatically filter out some repetitive k-mers from a motif model due to the introduction of the compositional complexity in the function. Consequently, the merits of our proposed MISCORE in terms of both motif signal modeling power and computational efficiency will make it more applicable in the development of computational motif discovery tools.
Collapse
Affiliation(s)
- Dianhui Wang
- Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, Victoria 3086, Australia.
| | | |
Collapse
|
9
|
Peng Y, Li X, Wu M, Yang J, Liu M, Zhang W, Xiang B, Wang X, Li X, Li G, Shen S. New prognosis biomarkers identified by dynamic proteomic analysis of colorectal cancer. MOLECULAR BIOSYSTEMS 2012; 8:3077-88. [PMID: 22996014 DOI: 10.1039/c2mb25286d] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The initiation, promotion and progression of human cancer are complex, polygenic, multi-factored processes. Through systematic proteomic analysis, different stages of CRC (colorectal cancer) biopsies were examined, and 199 differentially expressed proteins were detected between TNM (the tumor, nodes, and metastasis) stages I-IV and normal tissue (One-Way Analysis of Variance, ANOVA; p≤ 0.05). Instead of looking for biomarkers to distinguish CRC from normal or identify metastatic tumors, we focused on the variation tendency of CRC carcinogenesis and the dynamic expression patterns of proteins among the different stages. Som (self-organizing map clustering) analysis revealed eight unique expression patterns and that the cancer-related proteins were dynamically expressed, and their expression levels changed continuously throughout tumorigenesis. Molecular evidence emerged much earlier than visible, clinical or histological changes, which shows the potential prospect of building molecular staging. Proteins identified by MALDI-TOF MS (Matrix-Assisted Laser Desorption/Ionization Time of Flight Mass Spectrometry) were mainly involved in energy metabolism, acetylation and signaling pathways. Validation experiments using immunoblotting and immunohistochemistry (IHC) agreed with the 2D-DIGE (two-dimensional difference in gel electrophoresis) data. After survival classifier and LOOCV (leave-one-out cross-validation) analyses, the new prognostic biomarkers (78 kDa Glucose-Regulated Protein precursor (GRP78), Fructose-bisphosphate Aldolase A (ALDOA), Carbonic Anhydrase I (CA1) and Peptidyl-prolyl cis-trans isomerase A or Cyclophilin A (PPIA)) provided good survival prediction for TNM stage I-IV patients. The new biomarkers derived from the dynamic patterns of these proteins' expression provide is a good supplementary method for determining prognosis for CRC, especially for the TNM stage III and IV patients.
Collapse
Affiliation(s)
- Ya Peng
- Department of Digestive Diseases, Hunan Provincial People's Hospital, Hunan Normal University, Changsha, Hunan 410005, China
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 2012; 14:225-37. [PMID: 22517426 PMCID: PMC3603212 DOI: 10.1093/bib/bbs016] [Citation(s) in RCA: 93] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein-DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery.
Collapse
|