51
|
Al-Okaily A, Huang CH. ET-Motif: Solving the Exact (l, d)-Planted Motif Problem Using Error Tree Structure. J Comput Biol 2016; 23:615-23. [PMID: 27152692 DOI: 10.1089/cmb.2015.0238] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motif finding is an important and a challenging problem in many biological applications such as discovering promoters, enhancers, locus control regions, transcription factors, and more. The (l, d)-planted motif search, PMS, is one of several variations of the problem. In this problem, there are n given sequences over alphabets of size [Formula: see text], each of length m, and two given integers l and d. The problem is to find a motif m of length l, where in each sequence there is at least an l-mer at a Hamming distance of [Formula: see text] of m. In this article, we propose ET-Motif, an algorithm that can solve the PMS problem in [Formula: see text] time and [Formula: see text] space. The time bound can be further reduced by a factor of m with [Formula: see text] space. In case the suffix tree that is built for the input sequences is balanced, the problem can be solved in [Formula: see text] time and [Formula: see text] space. Similarly, the time bound can be reduced by a factor of m using [Formula: see text] space. Moreover, the variations of the problem, namely the edit distance PMS and edited PMS (Quorum), can be solved using ET-Motif with simple modifications but upper bands of space and time. For edit distance PMS, the time and space bounds will be increased by [Formula: see text], while for edited PMS the increase will be of [Formula: see text] in the time bound.
Collapse
Affiliation(s)
- Anas Al-Okaily
- Computer Science & Engineering Department, University of Connecticut , Storrs, Connecticut
| | - Chun-Hsi Huang
- Computer Science & Engineering Department, University of Connecticut , Storrs, Connecticut
| |
Collapse
|
52
|
Tangirala K, Herndon N, Caragea D. A Comparative Analysis Between k-Mers and Community Detection-Based Features for the Task of Protein Classification. IEEE Trans Nanobioscience 2016; 15:84-92. [PMID: 26863669 PMCID: PMC6245644 DOI: 10.1109/tnb.2016.2523501] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach used the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently used community detection to identify groups of k -mers that appear frequently in a set of sequences. Whereas this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extended our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.
Collapse
|
53
|
MOCCS: Clarifying DNA-binding motif ambiguity using ChIP-Seq data. Comput Biol Chem 2016; 63:62-72. [PMID: 26971251 DOI: 10.1016/j.compbiolchem.2016.01.014] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Accepted: 01/25/2016] [Indexed: 11/21/2022]
Abstract
BACKGROUND As a key mechanism of gene regulation, transcription factors (TFs) bind to DNA by recognizing specific short sequence patterns that are called DNA-binding motifs. A single TF can accept ambiguity within its DNA-binding motifs, which comprise both canonical (typical) and non-canonical motifs. Clarification of such DNA-binding motif ambiguity is crucial for revealing gene regulatory networks and evaluating mutations in cis-regulatory elements. Although chromatin immunoprecipitation sequencing (ChIP-seq) now provides abundant data on the genomic sequences to which a given TF binds, existing motif discovery methods are unable to directly answer whether a given TF can bind to a specific DNA-binding motif. RESULTS Here, we report a method for clarifying the DNA-binding motif ambiguity, MOCCS. Given ChIP-Seq data of any TF, MOCCS comprehensively analyzes and describes every k-mer to which that TF binds. Analysis of simulated datasets revealed that MOCCS is applicable to various ChIP-Seq datasets, requiring only a few minutes per dataset. Application to the ENCODE ChIP-Seq datasets proved that MOCCS directly evaluates whether a given TF binds to each DNA-binding motif, even if known position weight matrix models do not provide sufficient information on DNA-binding motif ambiguity. Furthermore, users are not required to provide numerous parameters or background genomic sequence models that are typically unavailable. MOCCS is implemented in Perl and R and is freely available via https://github.com/yuifu/moccs. CONCLUSIONS By complementing existing motif-discovery software, MOCCS will contribute to the basic understanding of how the genome controls diverse cellular processes via DNA-protein interactions.
Collapse
|
54
|
Pavesi G. ChIP-Seq Data Analysis to Define Transcriptional Regulatory Networks. ADVANCES IN BIOCHEMICAL ENGINEERING/BIOTECHNOLOGY 2016; 160:1-14. [PMID: 28070596 DOI: 10.1007/10_2016_43] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
The first step in the definition of transcriptional regulatory networks is to establish correct relationships between transcription factors (TFs) and their target genes, together with the effect of their regulatory activity (activator or repressor). Fundamental advances in this direction have been made possible by the introduction of experimental techniques such as Chromatin Immunoprecipitation, which, coupled with next-generation sequencing technologies (ChIP-Seq), permit the genome-wide identification of TF binding sites. This chapter provides a survey on how data of this kind are to be processed and integrated with expression and other types of data to infer transcriptional regulatory rules and codes.
Collapse
Affiliation(s)
- Giulio Pavesi
- Department of Biosciences, University of Milan, Via Celoria 26, 20133, Milan, Italy.
| |
Collapse
|
55
|
Kibet CK, Machanick P. Transcription factor motif quality assessment requires systematic comparative analysis. F1000Res 2015; 4:ISCB Comm J-1429. [PMID: 27092243 PMCID: PMC4821295 DOI: 10.12688/f1000research.7408.2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/29/2016] [Indexed: 11/22/2022] Open
Abstract
Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. We also demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis.
Collapse
Affiliation(s)
- Caleb Kipkurui Kibet
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| | - Philip Machanick
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| |
Collapse
|
56
|
Kibet CK, Machanick P. Transcription factor motif quality assessment requires systematic comparative analysis. F1000Res 2015; 4:ISCB Comm J-1429. [PMID: 27092243 DOI: 10.12688/f1000research.7408.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/19/2015] [Indexed: 03/26/2024] Open
Abstract
Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. Finally, we demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis.
Collapse
Affiliation(s)
- Caleb Kipkurui Kibet
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| | - Philip Machanick
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| |
Collapse
|
57
|
Shreif Z, Striegel DA, Periwal V. The jigsaw puzzle of sequence phenotype inference: Piecing together Shannon entropy, importance sampling, and Empirical Bayes. J Theor Biol 2015; 380:399-413. [PMID: 26092377 DOI: 10.1016/j.jtbi.2015.06.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2015] [Revised: 04/29/2015] [Accepted: 06/05/2015] [Indexed: 11/24/2022]
Abstract
A nucleotide sequence 35 base pairs long can take 1,180,591,620,717,411,303,424 possible values. An example of systems biology datasets, protein binding microarrays, contain activity data from about 40,000 such sequences. The discrepancy between the number of possible configurations and the available activities is enormous. Thus, albeit that systems biology datasets are large in absolute terms, they oftentimes require methods developed for rare events due to the combinatorial increase in the number of possible configurations of biological systems. A plethora of techniques for handling large datasets, such as Empirical Bayes, or rare events, such as importance sampling, have been developed in the literature, but these cannot always be simultaneously utilized. Here we introduce a principled approach to Empirical Bayes based on importance sampling, information theory, and theoretical physics in the general context of sequence phenotype model induction. We present the analytical calculations that underlie our approach. We demonstrate the computational efficiency of the approach on concrete examples, and demonstrate its efficacy by applying the theory to publicly available protein binding microarray transcription factor datasets and to data on synthetic cAMP-regulated enhancer sequences. As further demonstrations, we find transcription factor binding motifs, predict the activity of new sequences and extract the locations of transcription factor binding sites. In summary, we present a novel method that is efficient (requiring minimal computational time and reasonable amounts of memory), has high predictive power that is comparable with that of models with hundreds of parameters, and has a limited number of optimized parameters, proportional to the sequence length.
Collapse
Affiliation(s)
- Zeina Shreif
- Laboratory of Biological Modeling, National Institutes of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Building 12A, 12 South Drive, Bethesda, MD 20892, USA.
| | - Deborah A Striegel
- Laboratory of Biological Modeling, National Institutes of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Building 12A, 12 South Drive, Bethesda, MD 20892, USA.
| | - Vipul Periwal
- Laboratory of Biological Modeling, National Institutes of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Building 12A, 12 South Drive, Bethesda, MD 20892, USA.
| |
Collapse
|
58
|
Zhang Y, Wang P. A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets. BIOMED RESEARCH INTERNATIONAL 2015; 2015:218068. [PMID: 26236718 PMCID: PMC4509496 DOI: 10.1155/2015/218068] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Accepted: 06/04/2015] [Indexed: 11/17/2022]
Abstract
New high-throughput technique ChIP-seq, coupling chromatin immunoprecipitation experiment with high-throughput sequencing technologies, has extended the identification of binding locations of a transcription factor to the genome-wide regions. However, the most existing motif discovery algorithms are time-consuming and limited to identify binding motifs in ChIP-seq data which normally has the significant characteristics of large scale data. In order to improve the efficiency, we propose a fast cluster motif finding algorithm, named as FCmotif, to identify the (l, d) motifs in large scale ChIP-seq data set. It is inspired by the emerging substrings mining strategy to find the enriched substrings and then searching the neighborhood instances to construct PWM and cluster motifs in different length. FCmotif is not following the OOPS model constraint and can find long motifs. The effectiveness of proposed algorithm has been proved by experiments on the ChIP-seq data sets from mouse ES cells. The whole detection of the real binding motifs and processing of the full size data of several megabytes finished in a few minutes. The experimental results show that FCmotif has advantageous to deal with the (l, d) motif finding in the ChIP-seq data; meanwhile it also demonstrates better performance than other current widely-used algorithms such as MEME, Weeder, ChIPMunk, and DREME.
Collapse
Affiliation(s)
- Yipu Zhang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| | - Ping Wang
- Department of Automation, School of Electronics and Control Engineering, Chang'An University, Xi'an 710064, China
| |
Collapse
|
59
|
Zhang Y, He Y, Zheng G, Wei C. MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures. BMC Genomics 2015; 16 Suppl 7:S13. [PMID: 26099518 PMCID: PMC4474412 DOI: 10.1186/1471-2164-16-s7-s13] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Background Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically. Result Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods. Conclusion MOST+ is a fast and accurate de novo method for motif finding by integrating genomic sequence and experimental signals as clues.
Collapse
|
60
|
Lihu A, Holban T. A review of ensemble methods for de novo motif discovery in ChIP-Seq data. Brief Bioinform 2015; 16:964-73. [DOI: 10.1093/bib/bbv022] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2015] [Indexed: 01/17/2023] Open
|
61
|
Yu Q, Huo H, Chen X, Guo H, Vitter JS, Huan J. An Efficient Algorithm for Discovering Motifs in Large DNA Data Sets. IEEE Trans Nanobioscience 2015; 14:535-44. [PMID: 25872217 DOI: 10.1109/tnb.2015.2421340] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The planted (l,d) motif discovery has been successfully used to locate transcription factor binding sites in dozens of promoter sequences over the past decade. However, there has not been enough work done in identifying (l,d) motifs in the next-generation sequencing (ChIP-seq) data sets, which contain thousands of input sequences and thereby bring new challenge to make a good identification in reasonable time. To cater this need, we propose a new planted (l,d) motif discovery algorithm named MCES, which identifies motifs by mining and combining emerging substrings. Specially, to handle larger data sets, we design a MapReduce-based strategy to mine emerging substrings distributedly. Experimental results on the simulated data show that i) MCES is able to identify (l,d) motifs efficiently and effectively in thousands to millions of input sequences, and runs faster than the state-of-the-art (l,d) motif discovery algorithms, such as F-motif and TraverStringsR; ii) MCES is able to identify motifs without known lengths, and has a better identification accuracy than the competing algorithm CisFinder. Also, the validity of MCES is tested on real data sets. MCES is freely available at http://sites.google.com/site/feqond/mces.
Collapse
|
62
|
Suryamohan K, Halfon MS. Identifying transcriptional cis-regulatory modules in animal genomes. WILEY INTERDISCIPLINARY REVIEWS. DEVELOPMENTAL BIOLOGY 2015; 4:59-84. [PMID: 25704908 PMCID: PMC4339228 DOI: 10.1002/wdev.168] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Revised: 11/04/2014] [Accepted: 11/16/2014] [Indexed: 11/08/2022]
Abstract
UNLABELLED Gene expression is regulated through the activity of transcription factors (TFs) and chromatin-modifying proteins acting on specific DNA sequences, referred to as cis-regulatory elements. These include promoters, located at the transcription initiation sites of genes, and a variety of distal cis-regulatory modules (CRMs), the most common of which are transcriptional enhancers. Because regulated gene expression is fundamental to cell differentiation and acquisition of new cell fates, identifying, characterizing, and understanding the mechanisms of action of CRMs is critical for understanding development. CRM discovery has historically been challenging, as CRMs can be located far from the genes they regulate, have few readily identifiable sequence characteristics, and for many years were not amenable to high-throughput discovery methods. However, the recent availability of complete genome sequences and the development of next-generation sequencing methods have led to an explosion of both computational and empirical methods for CRM discovery in model and nonmodel organisms alike. Experimentally, CRMs can be identified through chromatin immunoprecipitation directed against TFs or histone post-translational modifications, identification of nucleosome-depleted 'open' chromatin regions, or sequencing-based high-throughput functional screening. Computational methods include comparative genomics, clustering of known or predicted TF-binding sites, and supervised machine-learning approaches trained on known CRMs. All of these methods have proven effective for CRM discovery, but each has its own considerations and limitations, and each is subject to a greater or lesser number of false-positive identifications. Experimental confirmation of predictions is essential, although shortcomings in current methods suggest that additional means of validation need to be developed. For further resources related to this article, please visit the WIREs website. CONFLICT OF INTEREST The authors have declared no conflicts of interest for this article.
Collapse
Affiliation(s)
- Kushal Suryamohan
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
| | - Marc S. Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics and Life Sciences, Buffalo, NY 14203, USA
- Molecular and Cellular Biology Department and Program in Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA
| |
Collapse
|
63
|
An adiabatic quantum algorithm and its application to DNA motif model discovery. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2014.10.057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
64
|
Zambelli F, Pesole G, Pavesi G. Using Weeder, Pscan, and PscanChIP for the Discovery of Enriched Transcription Factor Binding Site Motifs in Nucleotide Sequences. ACTA ACUST UNITED AC 2014; 47:2.11.1-31. [PMID: 25199791 DOI: 10.1002/0471250953.bi0211s47] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
One of the greatest challenges facing modern molecular biology is understanding the complex mechanisms regulating gene expression. A fundamental step in this process requires the characterization of sequence motifs involved in the regulation of gene expression at transcriptional and post-transcriptional levels. In particular, transcription is modulated by the interaction of transcription factors (TFs) with their corresponding binding sites. Weeder, Pscan, and PscanChIP are software tools freely available for noncommercial users as a stand-alone or Web-based applications for the automatic discovery of conserved motifs in a set of DNA sequences likely to be bound by the same TFs. Input for the tools can be promoter sequences from co-expressed or co-regulated genes (for which Weeder and Pscan are suitable), or regions identified through genome wide ChIP-seq or similar experiments (Weeder and PscanChIP). The motifs are either found by a de novo approach (Weeder) or by using descriptors of the binding specificity of TFs (Pscan and PscanChIP).
Collapse
Affiliation(s)
- Federico Zambelli
- Dipartimento di Bioscienze, Università di Milano, Italy; Istituto di Biomembrane e Bioenergetica, Consiglio Nazionale delle Ricerche, Bari, Italy
| | | | | |
Collapse
|
65
|
Tran NTL, Huang CH. A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data. Biol Direct 2014; 9:4. [PMID: 24555784 PMCID: PMC4022013 DOI: 10.1186/1745-6150-9-4] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2013] [Revised: 01/08/2014] [Accepted: 02/11/2014] [Indexed: 12/24/2022] Open
Abstract
Abstract ChIP-Seq (chromatin immunoprecipitation sequencing) has provided the advantage for finding motifs as ChIP-Seq experiments narrow down the motif finding to binding site locations. Recent motif finding tools facilitate the motif detection by providing user-friendly Web interface. In this work, we reviewed nine motif finding Web tools that are capable for detecting binding site motifs in ChIP-Seq data. We showed each motif finding Web tool has its own advantages for detecting motifs that other tools may not discover. We recommended the users to use multiple motif finding Web tools that implement different algorithms for obtaining significant motifs, overlapping resemble motifs, and non-overlapping motifs. Finally, we provided our suggestions for future development of motif finding Web tool that better assists researchers for finding motifs in ChIP-Seq data. Reviewers This article was reviewed by Prof. Sandor Pongor, Dr. Yuriy Gusev, and Dr. Shyam Prabhakar (nominated by Prof. Limsoon Wong).
Collapse
Affiliation(s)
- Ngoc Tam L Tran
- Department of Computer Science and Engineering, University of Connecticut, 371 Fairfield Way, Unit 4155, Storrs, CT 06269, USA.
| | | |
Collapse
|
66
|
Jia C, Carson MB, Wang Y, Lin Y, Lu H. A new exhaustive method and strategy for finding motifs in ChIP-enriched regions. PLoS One 2014; 9:e86044. [PMID: 24475069 PMCID: PMC3901781 DOI: 10.1371/journal.pone.0086044] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2013] [Accepted: 12/04/2013] [Indexed: 12/22/2022] Open
Abstract
ChIP-seq, which combines chromatin immunoprecipitation (ChIP) with next-generation parallel sequencing, allows for the genome-wide identification of protein-DNA interactions. This technology poses new challenges for the development of novel motif-finding algorithms and methods for determining exact protein-DNA binding sites from ChIP-enriched sequencing data. State-of-the-art heuristic, exhaustive search algorithms have limited application for the identification of short (l, d) motifs (l ≤ 10, d ≤ 2) contained in ChIP-enriched regions. In this work we have developed a more powerful exhaustive method (FMotif) for finding long (l, d) motifs in DNA sequences. In conjunction with our method, we have adopted a simple ChIP-enriched sampling strategy for finding these motifs in large-scale ChIP-enriched regions. Empirical studies on synthetic samples and applications using several ChIP data sets including 16 TF (transcription factor) ChIP-seq data sets and five TF ChIP-exo data sets have demonstrated that our proposed method is capable of finding these motifs with high efficiency and accuracy. The source code for FMotif is available at http://211.71.76.45/FMotif/.
Collapse
Affiliation(s)
- Caiyan Jia
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, China
- Department of Bioengineering/Bioinformatics, University of Illinois at Chicago, Chicago, Illinois, United States of America
| | - Matthew B. Carson
- Center for Healthcare Studies, Institute for Public Health and Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Yang Wang
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, China
| | - Youfang Lin
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, China
| | - Hui Lu
- Department of Bioengineering/Bioinformatics, University of Illinois at Chicago, Chicago, Illinois, United States of America
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai JiaoTong University, Shanghai, China
| |
Collapse
|
67
|
Gan Y, Guan J, Zhou S, Zhang W. Identifying Cis-Regulatory Elements and Modules Using Conditional Random Fields. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:73-82. [PMID: 26355509 DOI: 10.1109/tcbb.2013.131] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Accurate identification of cis-regulatory elements and their correlated modules is essential for analysis of transcriptional regulation, which is a challenging problem in computational biology. Unsupervised learning has the advantage of compensating for missing annotated data, and is thus promising to be effective to identify cis-regulatory elements and modules. We introduced a Conditional Random Fields model, referred to as CRFEM, to integrate sequence features and long-range dependency of genomic sequences such as epigenetic features to identify cis-regulatory elements and modules at the same time. The proposed method is able to automatically learn model parameters with no labeled data and explicitly optimize the predictive probability of cis-regulatory elements and modules. In comparison with existing methods, our method is more accurate and can be used for genome-wide studies of gene regulation.
Collapse
|
68
|
Wang D, Tapan S. A robust elicitation algorithm for discovering DNA motifs using fuzzy self-organizing maps. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2013; 24:1677-1688. [PMID: 24808603 DOI: 10.1109/tnnls.2013.2275733] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
It is important to identify DNA motifs in promoter regions to understand the mechanism of gene regulation. Computational approaches for finding DNA motifs are well recognized as useful tools to biologists, which greatly help in saving experimental time and cost in wet laboratories. Self-organizing maps (SOMs), as a powerful clustering tool, have demonstrated good potential for problem solving. However, the current SOM-based motif discovery algorithms unfairly treat data samples lying around the cluster boundaries by assigning them to one of the nodes, which may result in unreliable system performance. This paper aims to develop a robust framework for discovering DNA motifs, where fuzzy SOMs, with an integration of fuzzy c-means membership functions and a standard batch-learning scheme, are employed to extract putative motifs with varying length in a recursive manner. Experimental results on eight real datasets show that our proposed algorithm outperforms the other searching tools such as SOMBRERO, SOMEA, MEME, AlignACE, and WEEDER in terms of the F-measure and algorithm reliability. It is observed that a remarkable 24.6% improvement can be achieved compared to the state-of-the-art SOMBRERO. Furthermore, our algorithm can produce a 20% and 6.6% improvement over SOMBRERO and SOMEA, respectively, in finding multiple motifs on five artificial datasets.
Collapse
|
69
|
Saha S, Lindeberg M. Bound to Succeed: transcription factor binding-site prediction and its contribution to understanding virulence and environmental adaptation in bacterial plant pathogens. MOLECULAR PLANT-MICROBE INTERACTIONS : MPMI 2013; 26:1123-1130. [PMID: 23802990 DOI: 10.1094/mpmi-04-13-0090-cr] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Bacterial plant pathogens rely on a battalion of transcription factors to fine-tune their response to changing environmental conditions and to marshal the genetic resources required for successful pathogenesis. Prediction of transcription factor binding sites (TFBS) represents an important tool for elucidating regulatory networks and has been conducted in multiple genera of plant-pathogenic bacteria for the purpose of better understanding mechanisms of survival and pathogenesis. The major categories of TFBS that have been characterized are reviewed here, with emphasis on in silico methods used for site identification and challenges therein, their applicability to different types of sequence datasets, and insights into mechanisms of virulence and survival that have been gained through binding-site mapping. An improved strategy for establishing E-value cutoffs when using existing models to screen uncharacterized genomes is also discussed.
Collapse
|
70
|
Pettengill EA, Pettengill JB, Coleman GD. Elucidating the evolutionary history and expression patterns of nucleoside phosphorylase paralogs (vegetative storage proteins) in Populus and the plant kingdom. BMC PLANT BIOLOGY 2013; 13:118. [PMID: 23957885 PMCID: PMC3751785 DOI: 10.1186/1471-2229-13-118] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/03/2013] [Accepted: 08/15/2013] [Indexed: 05/09/2023]
Abstract
BACKGROUND Nucleoside phosphorylases (NPs) have been extensively investigated in human and bacterial systems for their role in metabolic nucleotide salvaging and links to oncogenesis. In plants, NP-like proteins have not been comprehensively studied, likely because there is no evidence of a metabolic function in nucleoside salvage. However, in the forest trees genus Populus a family of NP-like proteins function as an important ecophysiological adaptation for inter- and intra-seasonal nitrogen storage and cycling. RESULTS We conducted phylogenetic analyses to determine the distribution and evolution of NP-like proteins in plants. These analyses revealed two major clusters of NP-like proteins in plants. Group I proteins were encoded by genes across a wide range of plant taxa while proteins encoded by Group II genes were dominated by species belonging to the order Malpighiales and included the Populus Bark Storage Protein (BSP) and WIN4-like proteins. Additionally, we evaluated the NP-like genes in Populus by examining the transcript abundance of the 13 NP-like genes found in the Populus genome in various tissues of plants exposed to long-day (LD) and short-day (SD) photoperiods. We found that all 13 of the Populus NP-like genes belonging to either Group I or II are expressed in various tissues in both LD and SD conditions. Tests of natural selection and expression evolution analysis of the Populus genes suggests that divergence in gene expression may have occurred recently during the evolution of Populus, which supports the adaptive maintenance models. Lastly, in silico analysis of cis-regulatory elements in the promoters of the 13 NP-like genes in Populus revealed common regulatory elements known to be involved in light regulation, stress/pathogenesis and phytohormone responses. CONCLUSION In Populus, the evolution of the NP-like protein and gene family has been shaped by duplication events and natural selection. Expression data suggest that previously uncharacterized NP-like proteins may function in nutrient sensing and/or signaling. These proteins are members of Group I NP-like proteins, which are widely distributed in many plant taxa. We conclude that NP-like proteins may function in plants, although this function is undefined.
Collapse
Affiliation(s)
- Emily A Pettengill
- Department of Plant Science and Landscape Architecture, University of Maryland, Plant Science Building, College Park, Maryland, 20742, USA
| | - James B Pettengill
- Department of Plant Science and Landscape Architecture, University of Maryland, Takoma Park, Maryland, USA
| | - Gary D Coleman
- Department of Plant Science and Landscape Architecture, University of Maryland, Plant Science Building, College Park, Maryland, 20742, USA
| |
Collapse
|
71
|
Zambelli F, Pesole G, Pavesi G. PscanChIP: Finding over-represented transcription factor-binding site motifs and their correlations in sequences from ChIP-Seq experiments. Nucleic Acids Res 2013; 41:W535-43. [PMID: 23748563 PMCID: PMC3692095 DOI: 10.1093/nar/gkt448] [Citation(s) in RCA: 62] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing with next-generation technologies (ChIP-Seq) has become the de facto standard for building genome-wide maps of regions bound by a given transcription factor (TF). The regions identified, however, have to be further analyzed to determine the actual DNA-binding sites for the TF, as well as sites for other TFs belonging to the same TF complex or in general co-operating or interacting with it in transcription regulation. PscanChIP is a web server that, starting from a collection of genomic regions derived from a ChIP-Seq experiment, scans them using motif descriptors like JASPAR or TRANSFAC position-specific frequency matrices, or descriptors uploaded by users, and it evaluates both motif enrichment and positional bias within the regions according to different measures and criteria. PscanChIP can successfully identify not only the actual binding sites for the TF investigated by a ChIP-Seq experiment but also secondary motifs corresponding to other TFs that tend to bind the same regions, and, if present, precise positional correlations among their respective sites. The web interface is free for use, and there is no login requirement. It is available at http://www.beaconlab.it/pscan_chip_dev.
Collapse
Affiliation(s)
- Federico Zambelli
- Dipartimento di Bioscienze, Università di Milano, Via Celoria 26, 20133 Milano, Italy
| | | | | |
Collapse
|
72
|
Abstract
By its very nature, genomics produces large, high-dimensional datasets that are well suited to analysis by machine learning approaches. Here, we explain some key aspects of machine learning that make it useful for genome annotation, with illustrative examples from ENCODE.
Collapse
Affiliation(s)
- Kevin Y Yip
- Program in Computational Biology and Bioinformatics, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
- CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Chao Cheng
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
- Institute for Quantitative Biomedical Sciences, Norris Cotton Cancer Center, Geisel School of Medicine at Dartmouth, Lebanon, NH 03766, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, 51 Prospect Street, New Haven, CT 06511, USA
| |
Collapse
|