1
|
Maseko NN, Steenkamp ET, Wingfield BD, Wilken PM. An in Silico Approach to Identifying TF Binding Sites: Analysis of the Regulatory Regions of BUSCO Genes from Fungal Species in the Ceratocystidaceae Family. Genes (Basel) 2023; 14:genes14040848. [PMID: 37107606 PMCID: PMC10137650 DOI: 10.3390/genes14040848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 03/26/2023] [Accepted: 03/27/2023] [Indexed: 04/03/2023] Open
Abstract
Transcriptional regulation controls gene expression through regulatory promoter regions that contain conserved sequence motifs. These motifs, also known as regulatory elements, are critically important to expression, which is driving research efforts to identify and characterize them. Yeasts have been the focus of such studies in fungi, including in several in silico approaches. This study aimed to determine whether in silico approaches could be used to identify motifs in the Ceratocystidaceae family, and if present, to evaluate whether these correspond to known transcription factors. This study targeted the 1000 base-pair region upstream of the start codon of 20 single-copy genes from the BUSCO dataset for motif discovery. Using the MEME and Tomtom analysis tools, conserved motifs at the family level were identified. The results show that such in silico approaches could identify known regulatory motifs in the Ceratocystidaceae and other unrelated species. This study provides support to ongoing efforts to use in silico analyses for motif discovery.
Collapse
|
2
|
Sequence graph transform (SGT): a feature embedding function for sequence data mining. Data Min Knowl Discov 2022. [DOI: 10.1007/s10618-021-00813-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
|
3
|
Vahed M, Vahed M, Garmire LX. BML: a versatile web server for bipartite motif discovery. Brief Bioinform 2021; 23:6490318. [PMID: 34974623 PMCID: PMC8769915 DOI: 10.1093/bib/bbab536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Revised: 11/18/2021] [Accepted: 11/19/2021] [Indexed: 11/28/2022] Open
Abstract
Motif discovery and characterization are important for gene regulation analysis. The lack of intuitive and integrative web servers impedes the effective use of motifs. Most motif discovery web tools are either not designed for non-expert users or lacking optimization steps when using default settings. Here we describe bipartite motifs learning (BML), a parameter-free web server that provides a user-friendly portal for online discovery and analysis of sequence motifs, using high-throughput sequencing data as the input. BML utilizes both position weight matrix and dinucleotide weight matrix, the latter of which enables the expression of the interdependencies of neighboring bases. With input parameters concerning the motifs are given, the BML achieves significantly higher accuracy than other available tools for motif finding. When no parameters are given by non-expert users, unlike other tools, BML employs a learning method to identify motifs automatically and achieve accuracy comparable to the scenario where the parameters are set. The BML web server is freely available at http://motif.t-ridership.com/ (https://github.com/Mohammad-Vahed/BML).
Collapse
Affiliation(s)
- Mohammad Vahed
- Department of Pathology & Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles (UCLA), California, USA.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, 48105, USA
| | - Majid Vahed
- Pharmaceutical Sciences Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Lana X Garmire
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, 48105, USA
| |
Collapse
|
4
|
Menzel M, Hurka S, Glasenhardt S, Gogol-Döring A. NoPeak: k-mer-based motif discovery in ChIP-Seq data without peak calling. Bioinformatics 2021; 37:596-602. [PMID: 32991679 DOI: 10.1093/bioinformatics/btaa845] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Accepted: 09/14/2020] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION The discovery of sequence motifs mediating DNA-protein binding usually implies the determination of binding sites using high-throughput sequencing and peak calling. The determination of peaks, however, depends strongly on data quality and is susceptible to noise. RESULTS Here, we present a novel approach to reliably identify transcription factor-binding motifs from ChIP-Seq data without peak detection. By evaluating the distributions of sequencing reads around the different k-mers in the genome, we are able to identify binding motifs in ChIP-Seq data that yield no results in traditional pipelines. AVAILABILITY AND IMPLEMENTATION NoPeak is published under the GNU General Public License and available as a standalone console-based Java application at https://github.com/menzel/nopeak. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michael Menzel
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| | - Sabine Hurka
- Institute for Insect Biotechnology, Justus Liebig University, Giessen 35392, Germany
| | - Stefan Glasenhardt
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| | - Andreas Gogol-Döring
- MNI, Technische Hochschule Mittelhessen, University of Applied Sciences, Giessen 35390, Germany
| |
Collapse
|
5
|
He Y, Shen Z, Zhang Q, Wang S, Huang DS. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform 2020; 22:5916939. [PMID: 33005921 PMCID: PMC8293829 DOI: 10.1093/bib/bbaa229] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 08/19/2020] [Accepted: 08/24/2020] [Indexed: 01/18/2023] Open
Abstract
DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.
Collapse
Affiliation(s)
- Ying He
- computer science and technology at Tongji University, China
| | - Zhen Shen
- computer science and technology at Tongji University, China
| | - Qinhu Zhang
- computer science and technology at Tongji University, China
| | - Siguo Wang
- computer science and technology at Tongji University, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University
| |
Collapse
|
6
|
Sultan I, Fromion V, Schbath S, Nicolas P. Statistical modelling of bacterial promoter sequences for regulatory motif discovery with the help of transcriptome data: application to Listeria monocytogenes. J R Soc Interface 2020; 17:20200600. [PMID: 33023397 PMCID: PMC7653377 DOI: 10.1098/rsif.2020.0600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2020] [Accepted: 09/10/2020] [Indexed: 11/12/2022] Open
Abstract
Automatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. The central idea of this model is to improve the probabilistic representation of the promoter DNA sequences by incorporating covariates summarizing expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). A dedicated trans-dimensional Markov chain Monte Carlo algorithm adjusts the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe exact position relative to the transcription start site, and chooses the expression covariates relevant for each motif. All parameters are estimated simultaneously, for many motifs and many expression covariates. The method is applied to a dataset of transcription start sites and expression profiles available for Listeria monocytogenes. The results validate the approach and provide a new global view of the transcription regulatory network of this important pathogen. Remarkably, a previously unreported motif is found in promoter regions of ribosomal protein genes, suggesting a role in the regulation of growth.
Collapse
Affiliation(s)
- Ibrahim Sultan
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | | | | | - Pierre Nicolas
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| |
Collapse
|
7
|
Chahal G, Tyagi S, Ramialison M. Navigating the non-coding genome in heart development and Congenital Heart Disease. Differentiation 2019; 107:11-23. [PMID: 31102825 DOI: 10.1016/j.diff.2019.05.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Revised: 01/14/2019] [Accepted: 05/06/2019] [Indexed: 12/12/2022]
Abstract
Congenital Heart Disease (CHD) is characterised by a wide range of cardiac defects, from mild to life-threatening, which occur in babies worldwide. To date, there is no cure to CHD, however, progress in surgery has reduced its mortality allowing children affected by CHD to reach adulthood. In an effort to understand its genetic basis, several studies involving whole-genome sequencing (WGS) of patients with CHD have been undertaken and generated a great wealth of information. The majority of putative causative mutations identified in WGS studies fall into the non-coding part of the genome. Unfortunately, due to the lack of understanding of the function of these non-coding mutations, it is challenging to establish a causal link between the non-coding mutation and the disease. Thus, here we review the state-of-the-art approaches to interpret non-coding mutations in the context of CHD and address the following questions: What are the non-coding sequences important for cardiac function? Which technologies are used to identify them? Which resources are available to analyse them? What mutations are expected in these non-coding sequences? Learning from developmental process, what is their expected role in CHD?
Collapse
Affiliation(s)
- Gulrez Chahal
- Australian Regenerative Medicine Institute (ARMI), 15 Innovation Walk, Monash University, Wellington Road, Clayton, 3800, VIC, Australia; Systems Biology Institute (SBI), Wellington Road, Clayton, 3800, VIC, Australia
| | - Sonika Tyagi
- School of Biological Sciences, Monash University, Wellington Road, Clayton, 3800, VIC, Australia; Australian Genome Research Facility, 305 Grattan Street, Melbourne, VIC, 3000, Australia.
| | - Mirana Ramialison
- Australian Regenerative Medicine Institute (ARMI), 15 Innovation Walk, Monash University, Wellington Road, Clayton, 3800, VIC, Australia; Systems Biology Institute (SBI), Wellington Road, Clayton, 3800, VIC, Australia.
| |
Collapse
|
8
|
Lebatteux D, Remita AM, Diallo AB. Toward an Alignment-Free Method for Feature Extraction and Accurate Classification of Viral Sequences. J Comput Biol 2019; 26:519-535. [PMID: 31050550 DOI: 10.1089/cmb.2018.0239] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The classification of pathogens in emerging and re-emerging viruses represents major interests in taxonomic studies, functional genomics, host-pathogen interplay, prevention, and disease treatments. It consists of assigning a given sequence to its related group of known sequences sharing similar characteristics and traits. The challenges to such classification could be associated with several virus properties including recombination, mutation rate, multiplicity of motifs, and diversity. In domains such as pathogen monitoring and surveillance, it is important to detect and quantify known and novel taxa without exploiting the full and accurate alignments or virus family profiles. In this study, we propose an alignment-free method, CASTOR-KRFE, to detect discriminating subsequences within known pathogen sequences to classify accurately unknown pathogen sequences. This method includes three major steps: (1) vectorization of known viral genomic sequences based on k-mers to constitute the potential features, (2) efficient way of pattern extraction and evaluation maximizing classification performance, and (3) prediction of the minimal set of features fitting a given criterion (threshold of performance metric and maximum number of features). We assessed this method through a jackknife data partitioning on a dozen of various virus data sets, covering the seven major virus groups and including influenza virus, Ebola virus, human immunodeficiency virus 1, hepatitis C virus, hepatitis B virus, and human papillomavirus. CASTOR-KRFE provides a weighted average F-measure >0.96 over a wide range of viruses. Our method also shows better performance on complex virus data sets than multiple subsequences extractor for classification (MISSEL), a subsequence extraction method, and the Discriminative mode of MEME patterns extraction tool.
Collapse
Affiliation(s)
- Dylan Lebatteux
- Department of Computer Science, Université du Québec à Montréal, Montreal, Canada
| | - Amine M Remita
- Department of Computer Science, Université du Québec à Montréal, Montreal, Canada
| | | |
Collapse
|
9
|
Djordjevic M, Rodic A, Graovac S. From biophysics to 'omics and systems biology. EUROPEAN BIOPHYSICS JOURNAL: EBJ 2019; 48:413-424. [PMID: 30972433 DOI: 10.1007/s00249-019-01366-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 10/12/2018] [Revised: 02/12/2019] [Accepted: 04/03/2019] [Indexed: 01/03/2023]
Abstract
Recent decades brought a revolution to biology, driven mainly by exponentially increasing amounts of data coming from "'omics" sciences. To handle these data, bioinformatics often has to combine biologically heterogeneous signals, for which methods from statistics and engineering (e.g. machine learning) are often used. While such an approach is sometimes necessary, it effectively treats the underlying biological processes as a black box. Similarly, systems biology deals with inherently complex systems, characterized by a large number of degrees of freedom, and interactions that are highly non-linear. To deal with this complexity, the underlying physical interactions are often (over)simplified, such as in Boolean modelling of network dynamics. In this review, we argue for the utility of applying a biophysical approach in bioinformatics and systems biology, including discussion of two examples from our research which address sequence analysis and understanding intracellular gene expression dynamics.
Collapse
Affiliation(s)
- Marko Djordjevic
- Faculty of Biology, Institute of Physiology and Biochemistry, University of Belgrade, Belgrade, Serbia.
| | - Andjela Rodic
- Faculty of Biology, Institute of Physiology and Biochemistry, University of Belgrade, Belgrade, Serbia.,Interdisciplinary PhD Program in Biophysics, University of Belgrade, Belgrade, Serbia
| | - Stefan Graovac
- Faculty of Biology, Institute of Physiology and Biochemistry, University of Belgrade, Belgrade, Serbia.,Interdisciplinary PhD Program in Biophysics, University of Belgrade, Belgrade, Serbia
| |
Collapse
|
10
|
Bioinformatics Approaches to Gain Insights into cis-Regulatory Motifs Involved in mRNA Localization. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2019; 1203:165-194. [PMID: 31811635 DOI: 10.1007/978-3-030-31434-7_7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Messenger RNA (mRNA) is a fundamental intermediate in the expression of proteins. As an integral part of this important process, protein production can be localized by the targeting of mRNA to a specific subcellular compartment. The subcellular destination of mRNA is suggested to be governed by a region of its primary sequence or secondary structure, which consequently dictates the recruitment of trans-acting factors, such as RNA-binding proteins or regulatory RNAs, to form a messenger ribonucleoprotein particle. This molecular ensemble is requisite for precise and spatiotemporal control of gene expression. In the context of RNA localization, the description of the binding preferences of an RNA-binding protein defines a motif, and one, or more, instance of a given motif is defined as a localization element (zip code). In this chapter, we first discuss the cis-regulatory motifs previously identified as mRNA localization elements. We then describe motif representation in terms of entropy and information content and offer an overview of motif databases and search algorithms. Finally, we provide an outline of the motif topology of asymmetrically localized mRNA molecules.
Collapse
|
11
|
Saad C, Noé L, Richard H, Leclerc J, Buisine MP, Touzet H, Figeac M. DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data. BMC Bioinformatics 2018; 19:223. [PMID: 29890948 PMCID: PMC5996464 DOI: 10.1186/s12859-018-2215-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2017] [Accepted: 05/21/2018] [Indexed: 12/30/2022] Open
Abstract
Background Discovering over-represented approximate motifs in DNA sequences is an essential part of bioinformatics. This topic has been studied extensively because of the increasing number of potential applications. However, it remains a difficult challenge, especially with the huge quantity of data generated by high throughput sequencing technologies. To overcome this problem, existing tools use greedy algorithms and probabilistic approaches to find motifs in reasonable time. Nevertheless these approaches lack sensitivity and have difficulties coping with rare and subtle motifs. Results We developed DiNAMO (for DNA MOtif), a new software based on an exhaustive and efficient algorithm for IUPAC motif discovery. We evaluated DiNAMO on synthetic and real datasets with two different applications, namely ChIP-seq peaks and Systematic Sequencing Error analysis. DiNAMO proves to compare favorably with other existing methods and is robust to noise. Conclusions We shown that DiNAMO software can serve as a tool to search for degenerate motifs in an exact manner using IUPAC models. DiNAMO can be used in scanning mode with sliding windows or in fixed position mode, which makes it suitable for numerous potential applications. Availability https://github.com/bonsai-team/DiNAMO. Electronic supplementary material The online version of this article (10.1186/s12859-018-2215-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Chadi Saad
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, France. .,Univ. Lille, Inserm, Lille University Hospital, UMR-S 1172 - JPARC - Centre de Recherche Jean-Pierre AUBERT, Lille, F-59000, France.
| | - Laurent Noé
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, France
| | - Hugues Richard
- Sorbonne Université, UMR7238, Laboratory Computational and Quantitative Biology, LCQB, Paris, F-75005, France
| | - Julie Leclerc
- Univ. Lille, Inserm, Lille University Hospital, UMR-S 1172 - JPARC - Centre de Recherche Jean-Pierre AUBERT, Lille, F-59000, France
| | - Marie-Pierre Buisine
- Univ. Lille, Inserm, Lille University Hospital, UMR-S 1172 - JPARC - Centre de Recherche Jean-Pierre AUBERT, Lille, F-59000, France
| | - Hélène Touzet
- Univ. Lille, CNRS, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, Lille, France
| | - Martin Figeac
- Univ. Lille. Plateau de génomique fonctionnelle et structurale, Lille, F-59000, France
| |
Collapse
|
12
|
Caldonazzo Garbelini JM, Kashiwabara AY, Sanches DS. Sequence motif finder using memetic algorithm. BMC Bioinformatics 2018; 19:4. [PMID: 29298679 PMCID: PMC5751424 DOI: 10.1186/s12859-017-2005-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 12/18/2017] [Indexed: 11/10/2022] Open
Abstract
Background De novo prediction of Transcription Factor Binding Sites (TFBS) using computational methods is a difficult task and it is an important problem in Bioinformatics. The correct recognition of TFBS plays an important role in understanding the mechanisms of gene regulation and helps to develop new drugs. Results We here present Memetic Framework for Motif Discovery (MFMD), an algorithm that uses semi-greedy constructive heuristics as a local optimizer. In addition, we used a hybridization of the classic genetic algorithm as a global optimizer to refine the solutions initially found. MFMD can find and classify overrepresented patterns in DNA sequences and predict their respective initial positions. MFMD performance was assessed using ChIP-seq data retrieved from the JASPAR site, promoter sequences extracted from the ABS site, and artificially generated synthetic data. The MFMD was evaluated and compared with well-known approaches in the literature, called MEME and Gibbs Motif Sampler, achieving a higher f-score in the most datasets used in this work. Conclusions We have developed an approach for detecting motifs in biopolymers sequences. MFMD is a freely available software that can be promising as an alternative to the development of new tools for de novo motif discovery. Its open-source software can be downloaded at https://github.com/jadermcg/mfmd. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-2005-1) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jader M Caldonazzo Garbelini
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil.
| | - André Y Kashiwabara
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Bioinformatics Graduate Program, Federal University of Technology - Paraná, Cornélio Procópio, PR, Brazil
| |
Collapse
|
13
|
Triska M, Ivliev A, Nikolsky Y, Tatarinova TV. Analysis of cis-Regulatory Elements in Gene Co-expression Networks in Cancer. Methods Mol Biol 2017; 1613:291-310. [PMID: 28849565 DOI: 10.1007/978-1-4939-7027-8_11] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Analysis of gene co-expression networks is a powerful "data-driven" tool, invaluable for understanding cancer biology and mechanisms of tumor development. Yet, despite of completion of thousands of studies on cancer gene expression, there were few attempts to normalize and integrate co-expression data from scattered sources in a concise "meta-analysis" framework. Here we describe an integrated approach to cancer expression meta-analysis, which combines generation of "data-driven" co-expression networks with detailed statistical detection of promoter sequence motifs within the co-expression clusters. First, we applied Weighted Gene Co-Expression Network Analysis (WGCNA) workflow and Pearson's correlation to generate a comprehensive set of over 3000 co-expression clusters in 82 normalized microarray datasets from nine cancers of different origin. Next, we designed a genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. The approach, realized as cisExpress software module, was specifically designed for analysis of very large data sets such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node.We discovered that although co-expression modules are populated with different sets of genes, they share distinct stable patterns of co-regulation based on promoter sequence analysis. The number of motifs per co-expression cluster varies widely in accordance with cancer tissue of origin, with the largest number in colon (68 motifs) and the lowest in ovary (18 motifs). The top scored motifs are typically shared between several tissues; they define sets of target genes responsible for certain functionality of cancerogenesis. Both the co-expression modules and a database of precalculated motifs are publically available and accessible for further studies.
Collapse
Affiliation(s)
- Martin Triska
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA
| | | | - Yuri Nikolsky
- Prosapia Genetics, Solana Beach, CA, USA.,School of Systems Biology, George Mason University, Fairfax, VA, USA
| | - Tatiana V Tatarinova
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, USA. .,Center for Personalized Medicine, Children's Hospital Los Angeles, 4640 Hollywood Blvd, Los Angeles, CA, 90027, USA. .,A.A. Kharkevich Institute for Information Transmission Problems RAS, Moscow, Russia.
| |
Collapse
|
14
|
Czeizler E, Hirvola T, Karhu K. A graph-theoretical approach for motif discovery in protein sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:121-130. [PMID: 28055896 DOI: 10.1109/tcbb.2015.2511750] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Motif recognition is a challenging problem in bioinformatics due to the diversity of protein motifs. Many existing algorithms identify motifs of a given length, thus being either not applicable or not efficient when searching simultaneously for motifs of various lengths. Searching for gapped motifs, although very important, is a highly time-consuming task due to the combinatorial explosion of possible combinations implied by the consideration of long gaps. We introduce a new graph theoretical approach to identify motifs of various lengths, both with and without gaps. We compare our approach with two widely used methods: MEME and GLAM2 analyzing both the quality of the results and the required computational time. Our method provides results of a slightly higher level of quality than MEME but at a much faster rate, i.e., one eighth of MEME's query time. By using similarity indexing, we drop the query times down to an average of approximately one sixth of the ones required by GLAM2, while achieving a slightly higher level of quality of the results. More precisely, for sequence collections smaller than 50000 bytes GLAM2 is 13 times slower, while being at least as fast as our method on larger ones. The source code of our C++ implementation is freely available in GitHub: https://github.com/hirvolt1/debruijn-motif.
Collapse
|
15
|
Jayaram N, Usvyat D, R Martin AC. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics 2016; 17:547. [PMID: 27806697 PMCID: PMC6889335 DOI: 10.1186/s12859-016-1298-9] [Citation(s) in RCA: 58] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2016] [Accepted: 10/20/2016] [Indexed: 12/21/2022] Open
Abstract
Background Binding of transcription factors to transcription factor binding sites (TFBSs) is key to the mediation of transcriptional regulation. Information on experimentally validated functional TFBSs is limited and consequently there is a need for accurate prediction of TFBSs for gene annotation and in applications such as evaluating the effects of single nucleotide variations in causing disease. TFBSs are generally recognized by scanning a position weight matrix (PWM) against DNA using one of a number of available computer programs. Thus we set out to evaluate the best tools that can be used locally (and are therefore suitable for large-scale analyses) for creating PWMs from high-throughput ChIP-Seq data and for scanning them against DNA. Results We evaluated a set of de novo motif discovery tools that could be downloaded and installed locally using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool. TFBS prediction tools used to scan PWMs against DNA fall into two classes — those that predict individual TFBSs and those that identify clusters. Our evaluation showed that FIMO and MCAST performed best respectively. Conclusions Selection of the best-performing tools for generating PWMs from ChIP-Seq data and for scanning PWMs against DNA has the potential to improve prediction of precise transcription factor binding sites within regions identified by ChIP-Seq experiments for gene finding, understanding regulation and in evaluating the effects of single nucleotide variations in causing disease. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1298-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Narayan Jayaram
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Daniel Usvyat
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Andrew C R Martin
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
16
|
Tangirala K, Herndon N, Caragea D. A Comparative Analysis Between k-Mers and Community Detection-Based Features for the Task of Protein Classification. IEEE Trans Nanobioscience 2016; 15:84-92. [PMID: 26863669 PMCID: PMC6245644 DOI: 10.1109/tnb.2016.2523501] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach used the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently used community detection to identify groups of k -mers that appear frequently in a set of sequences. Whereas this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extended our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.
Collapse
|
17
|
Kibet CK, Machanick P. Transcription factor motif quality assessment requires systematic comparative analysis. F1000Res 2015; 4:ISCB Comm J-1429. [PMID: 27092243 PMCID: PMC4821295 DOI: 10.12688/f1000research.7408.2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 02/29/2016] [Indexed: 11/22/2022] Open
Abstract
Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. We also demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis.
Collapse
Affiliation(s)
- Caleb Kipkurui Kibet
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| | - Philip Machanick
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| |
Collapse
|
18
|
Kibet CK, Machanick P. Transcription factor motif quality assessment requires systematic comparative analysis. F1000Res 2015; 4:ISCB Comm J-1429. [PMID: 27092243 DOI: 10.12688/f1000research.7408.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 11/19/2015] [Indexed: 03/26/2024] Open
Abstract
Transcription factor (TF) binding site prediction remains a challenge in gene regulatory research due to degeneracy and potential variability in binding sites in the genome. Dozens of algorithms designed to learn binding models (motifs) have generated many motifs available in research papers with a subset making it to databases like JASPAR, UniPROBE and Transfac. The presence of many versions of motifs from the various databases for a single TF and the lack of a standardized assessment technique makes it difficult for biologists to make an appropriate choice of binding model and for algorithm developers to benchmark, test and improve on their models. In this study, we review and evaluate the approaches in use, highlight differences and demonstrate the difficulty of defining a standardized motif assessment approach. We review scoring functions, motif length, test data and the type of performance metrics used in prior studies as some of the factors that influence the outcome of a motif assessment. We show that the scoring functions and statistics used in motif assessment influence ranking of motifs in a TF-specific manner. We also show that TF binding specificity can vary by source of genomic binding data. Finally, we demonstrate that information content of a motif is not in isolation a measure of motif quality but is influenced by TF binding behaviour. We conclude that there is a need for an easy-to-use tool that presents all available evidence for a comparative analysis.
Collapse
Affiliation(s)
- Caleb Kipkurui Kibet
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| | - Philip Machanick
- Department of Computer Science and Research Unit in Bioinformatics (RUBi), Rhodes University, Grahamstown, South Africa
| |
Collapse
|
19
|
Maynou J, Pairó E, Marco S, Perera A. Sequence information gain based motif analysis. BMC Bioinformatics 2015; 16:377. [PMID: 26553056 PMCID: PMC4640167 DOI: 10.1186/s12859-015-0811-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2014] [Accepted: 10/30/2015] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND The detection of regulatory regions in candidate sequences is essential for the understanding of the regulation of a particular gene and the mechanisms involved. This paper proposes a novel methodology based on information theoretic metrics for finding regulatory sequences in promoter regions. RESULTS This methodology (SIGMA) has been tested on genomic sequence data for Homo sapiens and Mus musculus. SIGMA has been compared with different publicly available alternatives for motif detection, such as MEME/MAST, Biostrings (Bioconductor package), MotifRegressor, and previous work such Qresiduals projections or information theoretic based detectors. Comparative results, in the form of Receiver Operating Characteristic curves, show how, in 70% of the studied Transcription Factor Binding Sites, the SIGMA detector has a better performance and behaves more robustly than the methods compared, while having a similar computational time. The performance of SIGMA can be explained by its parametric simplicity in the modelling of the non-linear co-variability in the binding motif positions. CONCLUSIONS Sequence Information Gain based Motif Analysis is a generalisation of a non-linear model of the cis-regulatory sequences detection based on Information Theory. This generalisation allows us to detect transcription factor binding sites with maximum performance disregarding the covariability observed in the positions of the training set of sequences. SIGMA is freely available to the public at http://b2slab.upc.edu.
Collapse
Affiliation(s)
- Joan Maynou
- Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, Pau Gargallo, 5, Barcelona, 08028, Spain.
- CIBER de Bioingeniería, Biomateriales y Biomedicina, Spain.
| | - Erola Pairó
- Institute for BioEngineering of Catalonia, balidiri Reixach 4-6, Barcelona, 08028, Spain.
- Electronics Department in the University of Barcelona (UB), Martí i Franquès, 1, Barcelona, 08028, Spain.
| | - Santiago Marco
- Institute for BioEngineering of Catalonia, balidiri Reixach 4-6, Barcelona, 08028, Spain.
- Electronics Department in the University of Barcelona (UB), Martí i Franquès, 1, Barcelona, 08028, Spain.
| | - Alexandre Perera
- Departament d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, Pau Gargallo, 5, Barcelona, 08028, Spain.
- CIBER de Bioingeniería, Biomateriales y Biomedicina, Spain.
| |
Collapse
|
20
|
Zhang Y, He Y, Zheng G, Wei C. MOST+: A de novo motif finding approach combining genomic sequence and heterogeneous genome-wide signatures. BMC Genomics 2015; 16 Suppl 7:S13. [PMID: 26099518 PMCID: PMC4474412 DOI: 10.1186/1471-2164-16-s7-s13] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Background Motifs are regulatory elements that will activate or inhibit the expression of related genes when proteins (such as transcription factors, TFs) bind to them. Therefore, motif finding is important to understand the mechanisms of gene regulation. De novo discovery of regulatory elements, like transcription factor binding sites (TFBSs), has long been a major challenge to gain insight on mechanisms of gene regulation. Recent advances in experimental profiling of genome-wide signals such as histone modifications and DNase I hypersensitivity sites allow scientists to develop better computational methods to enhance motif discovery. However, existing methods for motif finding suffer from high false positive rates and slow speed, and it's difficult to evaluate the performance of these methods systematically. Result Here we present MOST+, a motif finder integrating genomic sequences and genome-wide signals such as intensity and shape features from histone modification marks and DNase I hypersensitivity sites, to improve the prediction accuracy. MOST+ can detect motifs from a large input sequence of about 100 Mbs within a few minutes. Systematic comparison method has been established and MOST+ has been compared with existing methods. Conclusion MOST+ is a fast and accurate de novo method for motif finding by integrating genomic sequence and experimental signals as clues.
Collapse
|
21
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 332] [Impact Index Per Article: 33.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
22
|
Wong AKC, Lee ESA. Aligning and Clustering Patterns to Reveal the Protein Functionality of Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:548-560. [PMID: 26356022 DOI: 10.1109/tcbb.2014.2306840] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Discovering sequence patterns with variations unveils significant functions of a protein family. Existing combinatorial methods of discovering patterns with variations are computationally expensive, and probabilistic methods require more elaborate probabilistic representation of the amino acid associations. To overcome these shortcomings, this paper presents a new computationally efficient method for representing patterns with variations in a compact representation called Aligned Pattern Cluster (AP Cluster). To tackle the runtime, our method discovers a shortened list of non-redundant statistically significant sequence associations based on our previous work. To address the representation of protein functional regions, our pattern alignment and clustering step, presented in this paper captures the conservations and variations of the aligned patterns. We further refine our solution to allow more coverage of sequences via extending the AP Clusters containing only statistically significant patterns to Weak and Conserved AP Clusters. When applied to the cytochrome c, the ubiquitin, and the triosephosphate isomerase protein families, our algorithm identifies the binding segments as well as the binding residues. When compared to other methods, ours discovers all binding sites in the AP Clusters with superior entropy and coverage. The identification of patterns with variations help biologists to avoid time-consuming simulations and experimentations. (Software available upon request).
Collapse
|
23
|
Carvalho L. Bayesian centroid estimation for motif discovery. PLoS One 2013; 8:e80511. [PMID: 24324603 PMCID: PMC3855595 DOI: 10.1371/journal.pone.0080511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2013] [Accepted: 10/03/2013] [Indexed: 11/29/2022] Open
Abstract
Biological sequences may contain patterns that signal important biomolecular functions; a classical example is regulation of gene expression by transcription factors that bind to specific patterns in genomic promoter regions. In motif discovery we are given a set of sequences that share a common motif and aim to identify not only the motif composition, but also the binding sites in each sequence of the set. We propose a new centroid estimator that arises from a refined and meaningful loss function for binding site inference. We discuss the main advantages of centroid estimation for motif discovery, including computational convenience, and how its principled derivation offers further insights about the posterior distribution of binding site configurations. We also illustrate, using simulated and real datasets, that the centroid estimator can differ from the traditional maximum a posteriori or maximum likelihood estimators.
Collapse
Affiliation(s)
- Luis Carvalho
- Department of Mathematics and Statistics, Boston University, Boston, Massachusetts, United States of America
| |
Collapse
|
24
|
Transcription of Tnfaip3 is regulated by NF-κB and p38 via C/EBPβ in activated macrophages. PLoS One 2013; 8:e73153. [PMID: 24023826 PMCID: PMC3759409 DOI: 10.1371/journal.pone.0073153] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2013] [Accepted: 07/17/2013] [Indexed: 11/19/2022] Open
Abstract
Macrophages play a pivotal role in the immune system through recognition and elimination of microbial pathogens. Toll-like receptors (TLRs) on macrophages interact with microbial substances and initiate signal transduction through intracellular adapters. TLR4, which recognizes the lipopolysaccharides (LPS) on Gram-positive and Gram-negative bacteria, triggers downstream signaling mediators and eventually activates IκB kinase (IKK) complex and mitogen-activated protein kinases (MAPKs) such as p38. Previous reports revealed that, in addition to NF-κB, a core transcription factor of the innate immune response, the induction of some LPS-induced genes in macrophages required another transcription factor whose activity depends on p38. However, these additional transcription factors remain to be identified. In order to identify p38-activated transcription factors that cooperate with NF-κB in response to LPS stimulation, microarrays were used to identify genes regulated by both NF-κB and p38 using wild-type, IKK-depleted, and p38 inhibitor-treated mouse bone marrow-derived macrophages (BMDMs). In silico analysis of transcription factor binding sites was used to predict the potential synergistic transcription factors from the co-expressed genes. Among these genes, NF-κB and C/EBPβ, a p38 downstream transcription factor, were predicted to co-regulate genes in LPS-stimulated BMDMs. Based on the subsequent results of a chromatin immunoprecipitation assay and TNFAIP3 expression in C/EBPβ-ablated macrophages, we demonstrated that Tnfaip3 is regulated by both NF-κB and p38-dependent C/EBPβ. These results identify a novel regulatory mechanism in TLR4-mediated innate immunity.
Collapse
|
25
|
Triska M, Grocutt D, Southern J, Murphy DJ, Tatarinova T. cisExpress: motif detection in DNA sequences. ACTA ACUST UNITED AC 2013; 29:2203-5. [PMID: 23793750 DOI: 10.1093/bioinformatics/btt366] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION One of the major challenges for contemporary bioinformatics is the analysis and accurate annotation of genomic datasets to enable extraction of useful information about the functional role of DNA sequences. This article describes a novel genome-wide statistical approach to the detection of specific DNA sequence motifs based on similarities between the promoters of similarly expressed genes. This new tool, cisExpress, is especially designed for use with large datasets, such as those generated by publicly accessible whole genome and transcriptome projects. cisExpress uses a task farming algorithm to exploit all available computational cores within a shared memory node. We demonstrate the robust nature and validity of the proposed method. It is applicable for use with a wide range of genomic databases for any species of interest. AVAILABILITY cisExpress is available at www.cisexpress.org.
Collapse
Affiliation(s)
- Martin Triska
- Genomics and Computational Biology Research Group, Faculty of Computing, Engineering and Science, University of South Wales, Pontypridd, UK
| | | | | | | | | |
Collapse
|
26
|
Leibovich L, Paz I, Yakhini Z, Mandel-Gutfreund Y. DRIMust: a web server for discovering rank imbalanced motifs using suffix trees. Nucleic Acids Res 2013; 41:W174-9. [PMID: 23685432 PMCID: PMC3692051 DOI: 10.1093/nar/gkt407] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Cellular regulation mechanisms that involve proteins and other active molecules interacting with specific targets often involve the recognition of sequence patterns. Short sequence elements on DNA, RNA and proteins play a central role in mediating such molecular recognition events. Studies that focus on measuring and investigating sequence-based recognition processes make use of statistical and computational tools that support the identification and understanding of sequence motifs. We present a new web application, named DRIMust, freely accessible through the website http://drimust.technion.ac.il for de novo motif discovery services. The DRIMust algorithm is based on the minimum hypergeometric statistical framework and uses suffix trees for an efficient enumeration of motif candidates. DRIMust takes as input ranked lists of sequences in FASTA format and returns motifs that are over-represented at the top of the list, where the determination of the threshold that defines top is data driven. The resulting motifs are presented individually with an accurate P-value indication and as a Position Specific Scoring Matrix. Comparing DRIMust with other state-of-the-art tools demonstrated significant advantage to DRIMust, both in result accuracy and in short running times. Overall, DRIMust is unique in combining efficient search on large ranked lists with rigorous P-value assessment for the detected motifs.
Collapse
Affiliation(s)
- Limor Leibovich
- Department of Computer Science, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel
| | | | | | | |
Collapse
|
27
|
Orenstein Y, Linhart C, Shamir R. Assessment of algorithms for inferring positional weight matrix motifs of transcription factor binding sites using protein binding microarray data. PLoS One 2012; 7:e46145. [PMID: 23029415 PMCID: PMC3460961 DOI: 10.1371/journal.pone.0046145] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2012] [Accepted: 08/27/2012] [Indexed: 01/05/2023] Open
Abstract
The new technology of protein binding microarrays (PBMs) allows simultaneous measurement of the binding intensities of a transcription factor to tens of thousands of synthetic double-stranded DNA probes, covering all possible 10-mers. A key computational challenge is inferring the binding motif from these data. We present a systematic comparison of four methods developed specifically for reconstructing a binding site motif represented as a positional weight matrix from PBM data. The reconstructed motifs were evaluated in terms of three criteria: concordance with reference motifs from the literature and ability to predict in vivo and in vitro bindings. The evaluation encompassed over 200 transcription factors and some 300 assays. The results show a tradeoff between how the methods perform according to the different criteria, and a dichotomy of method types. Algorithms that construct motifs with low information content predict PBM probe ranking more faithfully, while methods that produce highly informative motifs match reference motifs better. Interestingly, in predicting high-affinity binding, all methods give far poorer results for in vivo assays compared to in vitro assays.
Collapse
Affiliation(s)
- Yaron Orenstein
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Chaim Linhart
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel-Aviv, Israel
- * E-mail:
| |
Collapse
|
28
|
Lee C, Huang CH. Searching for transcription factor binding sites in vector spaces. BMC Bioinformatics 2012; 13:215. [PMID: 23244338 PMCID: PMC3543194 DOI: 10.1186/1471-2105-13-215] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2012] [Accepted: 08/16/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational approaches to transcription factor binding site identification have been actively researched in the past decade. Learning from known binding sites, new binding sites of a transcription factor in unannotated sequences can be identified. A number of search methods have been introduced over the years. However, one can rarely find one single method that performs the best on all the transcription factors. Instead, to identify the best method for a particular transcription factor, one usually has to compare a handful of methods. Hence, it is highly desirable for a method to perform automatic optimization for individual transcription factors. RESULTS We proposed to search for transcription factor binding sites in vector spaces. This framework allows us to identify the best method for each individual transcription factor. We further introduced two novel methods, the negative-to-positive vector (NPV) and optimal discriminating vector (ODV) methods, to construct query vectors to search for binding sites in vector spaces. Extensive cross-validation experiments showed that the proposed methods significantly outperformed the ungapped likelihood under positional background method, a state-of-the-art method, and the widely-used position-specific scoring matrix method. We further demonstrated that motif subtypes of a TF can be readily identified in this framework and two variants called the k NPV and k ODV methods benefited significantly from motif subtype identification. Finally, independent validation on ChIP-seq data showed that the ODV and NPV methods significantly outperformed the other compared methods. CONCLUSIONS We conclude that the proposed framework is highly flexible. It enables the two novel methods to automatically identify a TF-specific subspace to search for binding sites. Implementations are available as source code at: http://biogrid.engr.uconn.edu/tfbs_search/.
Collapse
Affiliation(s)
- Chih Lee
- Department of Computer Science and Engineering, University of Connecticut, Fairfield Road, Storrs, CT 06269, USA
| | - Chun-Hsi Huang
- Department of Computer Science and Engineering, University of Connecticut, Fairfield Road, Storrs, CT 06269, USA
| |
Collapse
|
29
|
PROSPERI MATTIACF, PROSPERI LUCIANO, GRAY REBECCAR, SALEMI MARCO. ON COUNTING THE FREQUENCY DISTRIBUTION OF STRING MOTIFS IN MOLECULAR SEQUENCES. INT J BIOMATH 2012. [DOI: 10.1142/s1793524512500556] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
This work investigates frequency distributions of strings within a text. The mathematical derivation accounts for variable alphabet size, character probabilities, and string/text lengths, under both the Bernoullian and the Markovian model for string generation. The analysis is limited to the set of non-clumpable strings, that cannot overlap with themselves. Two formulae (exact and approximated) are derived, calculating the frequency distribution of a string of length m found inside a text of length n (with m < n). The approximated formula has a constant complexity (in contrast to an exponential complexity of the exact) and makes it applicable to very long texts. The proposed formulae were applied to analyze string frequencies in a portion of the human genome, and to recalculate frequencies of known repeated motif within genes, associated to genetic diseases. A comparison with state-of-the-art methods was provided. The formulae presented here can be of use in the statistical evaluation of specific motif frequencies within very long texts (e.g. genes or genomes) and help in characterizing motifs in pathologic conditions.
Collapse
Affiliation(s)
- MATTIA C. F. PROSPERI
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, Emerging Pathogens Institute, University of Florida, P. O. Box 103633, 2055 Mowry Road, Gainesville, FL 32610-3633, USA
| | | | | | - MARCO SALEMI
- Department of Pathology, Immunology and Laboratory Medicine, College of Medicine, Emerging Pathogens Institute, University of Florida, P. O. Box 103633, 2055 Mowry Road, Gainesville, FL 32610-3633, USA
| |
Collapse
|
30
|
Deyneko IV, Weiss S, Leschner S. An integrative computational approach to effectively guide experimental identification of regulatory elements in promoters. BMC Bioinformatics 2012; 13:202. [PMID: 22897887 PMCID: PMC3465240 DOI: 10.1186/1471-2105-13-202] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2012] [Accepted: 08/01/2012] [Indexed: 01/22/2023] Open
Abstract
Background Transcriptional activity of genes depends on many factors like DNA motifs, conformational characteristics of DNA, melting etc. and there are computational approaches for their identification. However, in real applications, the number of predicted, for example, DNA motifs may be considerably large. In cases when various computational programs are applied, systematic experimental knock out of each of the potential elements obviously becomes nonproductive. Hence, one needs an approach that is able to integrate many heterogeneous computational methods and upon that suggest selected regulatory elements for experimental verification. Results Here, we present an integrative bioinformatic approach aimed at the discovery of regulatory modules that can be effectively verified experimentally. It is based on combinatorial analysis of known and novel binding motifs, as well as of any other known features of promoters. The goal of this method is the identification of a collection of modules that are specific for an established dataset and at the same time are optimal for experimental verification. The method is particularly effective on small datasets, where most statistical approaches fail. We apply it to promoters that drive tumor-specific gene expression in tumor-colonizing Gram-negative bacteria. The method successfully identified a number of potential modules, which required only a few experiments to be verified. The resulting minimal functional bacterial promoter exhibited high specificity of expression in cancerous tissue. Conclusions Experimental analysis of promoter structures guided by bioinformatics has proved to be efficient. The developed computational method is able to include heterogeneous features of promoters and suggest combinatorial modules for experimental testing. Expansibility and robustness of the methodology implemented in the approach ensures good results for a wide range of problems.
Collapse
Affiliation(s)
- Igor V Deyneko
- Molecular Immunology, Helmholtz Centre for Infection Research, Inhoffenstr, 7, 38124 Braunschweig, Germany.
| | | | | |
Collapse
|
31
|
Cserháti M, Turóczy Z, Dudits D, Györgyey J. The rice word landscape: a detailed catalogue of the rice motif content in the non-coding regions. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2012; 16:334-42. [PMID: 22702246 DOI: 10.1089/omi.2011.0056] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
Among the different areas of molecular biology concerning the detailed study of different parts of the cell, such as genomics, proteomics, and metabolomics, different new areas of study are emerging which entail the analysis of different parts of the genome, such as the prediction of genes or different kinds of transcription factor binding sites (TFBSs). The goal of this study was to construct and analyze a catalogue of all statistically relevant putative functional octamer words or motifs (which we have termed the "motifome" of a given organism) found within first introns, promoters, the 5' and 3' untranslated regions (UTRs), and the entire genome of japonica rice, and compare them to results attained from a previous analysis performed on the Arabidopsis genome. We found a number of novel motifs in different sets of non-coding rice sequence sets. The diversity of motifs in rice was higher in Arabidopsis, implicating a higher mutation turnover. While common motifs were found between the two species, motif pairs were missing, showing the difference between the regulatory machinery between rice and Arabidopsis.
Collapse
Affiliation(s)
- Mátyás Cserháti
- Institute of Plant Biology, Biological Research Center, Szeged, Hungary.
| | | | | | | |
Collapse
|
32
|
Cserháti M, Turóczy Z, Dudits D, Györgyey J. The rice word landscape--a detailed catalog of the rice motif content in the noncoding regions. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2012; 15:819-28. [PMID: 22122670 DOI: 10.1089/omi.2011.0132] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Among the different areas of molecular biology concerning the detailed study of different parts of the cell such as genomics, proteomics, or metabolomics, different new areas of study are emerging that entail the analysis of different parts of the genome such as the prediction of genes or different kinds of transcription factor binding sites (TFBSs). The goal of this study is to draw up and analyze a catalog of all statistically relevant putative functional octamer words or motifs found within first introns, promoters, the 5' and 3' UTRs, and the entire genome of japonica rice and compare them to results attained from a previous analysis performed on the Arabidopsis genome. We found a number of novel motifs in different sets of noncoding rice sequence sets. The diversity of motifs in rice was higher in Arabidopsis, implicating a higher mutation turnover. Although common motifs were found between the two species, motif pairs were missing, showing the difference between the regulatory machinery between rice and Arabidopsis.
Collapse
Affiliation(s)
- Mátyás Cserháti
- Institute of Plant Biology, Biological Research Center, Szeged, Hungary.
| | | | | | | |
Collapse
|
33
|
Zambelli F, Pesole G, Pavesi G. Motif discovery and transcription factor binding sites before and after the next-generation sequencing era. Brief Bioinform 2012; 14:225-37. [PMID: 22517426 PMCID: PMC3603212 DOI: 10.1093/bib/bbs016] [Citation(s) in RCA: 93] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Motif discovery has been one of the most widely studied problems in bioinformatics ever since genomic and protein sequences have been available. In particular, its application to the de novo prediction of putative over-represented transcription factor binding sites in nucleotide sequences has been, and still is, one of the most challenging flavors of the problem. Recently, novel experimental techniques like chromatin immunoprecipitation (ChIP) have been introduced, permitting the genome-wide identification of protein-DNA interactions. ChIP, applied to transcription factors and coupled with genome tiling arrays (ChIP on Chip) or next-generation sequencing technologies (ChIP-Seq) has opened new avenues in research, as well as posed new challenges to bioinformaticians developing algorithms and methods for motif discovery.
Collapse
|
34
|
Pairó E, Maynou J, Marco S, Perera A. A subspace method for the detection of transcription factor binding sites. Bioinformatics 2012; 28:1328-35. [DOI: 10.1093/bioinformatics/bts147] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
35
|
Leibovich L, Yakhini Z. Efficient motif search in ranked lists and applications to variable gap motifs. Nucleic Acids Res 2012; 40:5832-47. [PMID: 22416066 PMCID: PMC3401424 DOI: 10.1093/nar/gks206] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Sequence elements, at all levels—DNA, RNA and protein, play a central role in mediating molecular recognition and thereby molecular regulation and signaling. Studies that focus on measuring and investigating sequence-based recognition make use of statistical and computational tools, including approaches to searching sequence motifs. State-of-the-art motif searching tools are limited in their coverage and ability to address large motif spaces. We develop and present statistical and algorithmic approaches that take as input ranked lists of sequences and return significant motifs. The efficiency of our approach, based on suffix trees, allows searches over motif spaces that are not covered by existing tools. This includes searching variable gap motifs—two half sites with a flexible length gap in between—and searching long motifs over large alphabets. We used our approach to analyze several high-throughput measurement data sets and report some validation results as well as novel suggested motifs and motif refinements. We suggest a refinement of the known estrogen receptor 1 motif in humans, where we observe gaps other than three nucleotides that also serve as significant recognition sites, as well as a variable length motif related to potential tyrosine phosphorylation.
Collapse
Affiliation(s)
- Limor Leibovich
- Department of Computer Science, Technion-Israel Institute of Technology, Haifa, 32000, Israel
| | | |
Collapse
|
36
|
Wang D, Do HT. Computational localization of transcription factor binding sites using extreme learning machines. Soft comput 2012. [DOI: 10.1007/s00500-012-0820-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
37
|
Vijayvargiya S, Shukla P. A niched Pareto genetic algorithm for finding variable length regulatory motifs in DNA sequences. 3 Biotech 2011. [PMCID: PMC3376862 DOI: 10.1007/s13205-011-0040-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
The transcription factor binding sites also called as motifs are short, recurring patterns in DNA sequences that are presumed to have a biological function. Identification of the motifs from the promoter region of the genes is an important and unsolved problem specifically in the eukaryotic genomes. In this paper, we present a niched Pareto genetic algorithm to identify the regulatory motifs. This approach is based on the maximization of two objectives of the problem that is the motif length and the consensus similarity score. A long motif means it is less likely to be a false motif. The similarity score represents a motifs probability of conservation in a given set of sequences. Proposed method can find multiple, variable length motifs. In this method, we represented a candidate motif as a combination of length and starting position of the motif in each sequence of the co-regulated genes. This enables the algorithm to identify multiple motifs of variable length. We applied this approach on various data sets and the results show that it can find multiple motifs of variable length in co-regulated genes.
Collapse
|
38
|
Ichinose N, Yada T, Gotoh O. Large-scale motif discovery using DNA Gray code and equiprobable oligomers. ACTA ACUST UNITED AC 2011; 28:25-31. [PMID: 22057160 PMCID: PMC3244767 DOI: 10.1093/bioinformatics/btr606] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Motivation: How to find motifs from genome-scale functional sequences, such as all the promoters in a genome, is a challenging problem. Word-based methods count the occurrences of oligomers to detect excessively represented ones. This approach is known to be fast and accurate compared with other methods. However, two problems have hampered the application of such methods to large-scale data. One is the computational cost necessary for clustering similar oligomers, and the other is the bias in the frequency of fixed-length oligomers, which complicates the detection of significant words. Results: We introduce a method that uses a DNA Gray code and equiprobable oligomers, which solve the clustering problem and the oligomer bias, respectively. Our method can analyze 18 000 sequences of ~1 kbp long in 30 s. We also show that the accuracy of our method is superior to that of a leading method, especially for large-scale data and small fractions of motif-containing sequences. Availability: The online and stand-alone versions of the application, named Hegma, are available at our website: http://www.genome.ist.i.kyoto-u.ac.jp/~ichinose/hegma/ Contact:ichinose@i.kyoto-u.ac.jp; o.gotoh@i.kyoto-u.ac.jp
Collapse
Affiliation(s)
- Natsuhiro Ichinose
- Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan
| | | | | |
Collapse
|
39
|
Linhart C, Halperin Y, Darom A, Kidron S, Broday L, Shamir R. A novel candidate cis-regulatory motif pair in the promoters of germline and oogenesis genes in C. elegans. Genome Res 2011; 22:76-83. [PMID: 21930893 DOI: 10.1101/gr.115626.110] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
In this study we report on a novel pair of cis-regulatory motifs in promoter sequences of the nematode Caenorhabditis elegans. The motif pair exhibits extraordinary genomic traits: The order and the orientation of the two motifs are highly specific, and the distance between them is almost always one of two frequent distances. In contrast, the sequence between the motifs is variable across occurrences. Thus, the motif pair constitutes a nearly combinatorial sequence configuration. We further show that this module is conserved among, and unique to, the entire Caenorhabditis genus. By analyzing several gene expression data sets, our data suggest that this motif pair may function in germline development, oogenesis, and early embryogenesis. Finally, we verify that the motifs are indeed functional cis-regulatory elements using reporter constructs in transgenic C. elegans.
Collapse
Affiliation(s)
- Chaim Linhart
- School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
| | | | | | | | | | | |
Collapse
|
40
|
Tree-based position weight matrix approach to model transcription factor binding site profiles. PLoS One 2011; 6:e24210. [PMID: 21912677 PMCID: PMC3166302 DOI: 10.1371/journal.pone.0024210] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2011] [Accepted: 08/02/2011] [Indexed: 11/30/2022] Open
Abstract
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions.
Collapse
|
41
|
Zhang S, Li S, Niu M, Pham PT, Su Z. MotifClick: prediction of cis-regulatory binding sites via merging cliques. BMC Bioinformatics 2011; 12:238. [PMID: 21679436 PMCID: PMC3225181 DOI: 10.1186/1471-2105-12-238] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2010] [Accepted: 06/16/2011] [Indexed: 11/21/2022] Open
Abstract
Background Although dozens of algorithms and tools have been developed to find a set of cis-regulatory binding sites called a motif in a set of intergenic sequences using various approaches, most of these tools focus on identifying binding sites that are significantly different from their background sequences. However, some motifs may have a similar nucleotide distribution to that of their background sequences. Therefore, such binding sites can be missed by these tools. Results Here, we present a graph-based polynomial-time algorithm, MotifClick, for the prediction of cis-regulatory binding sites, in particular, those that have a similar nucleotide distribution to that of their background sequences. To find binding sites with length k, we construct a graph using some 2(k-1)-mers in the input sequences as the vertices, and connect two vertices by an edge if the maximum number of matches of the local gapless alignments between the two 2(k-1)-mers is greater than a cutoff value. We identify a motif as a set of similar k-mers from a merged group of maximum cliques associated with some vertices. Conclusions When evaluated on both synthetic and real datasets of prokaryotes and eukaryotes, MotifClick outperforms existing leading motif-finding tools for prediction accuracy and balancing the prediction sensitivity and specificity in general. In particular, when the distribution of nucleotides of binding sites is similar to that of their background sequences, MotifClick is more likely to identify the binding sites than the other tools.
Collapse
Affiliation(s)
- Shaoqiang Zhang
- Department of Bioinformatics and Genomics, Center for Bioinformatics Research, the University of North Carolina at Charlotte, 28223, USA
| | | | | | | | | |
Collapse
|
42
|
Zheng X, Liu T, Yang Z, Wang J. Large cliques in Arabidopsis gene coexpression network and motif discovery. JOURNAL OF PLANT PHYSIOLOGY 2011; 168:611-618. [PMID: 21044807 DOI: 10.1016/j.jplph.2010.09.010] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2010] [Revised: 08/31/2010] [Accepted: 09/06/2010] [Indexed: 05/30/2023]
Abstract
Identification of cis-regulatory elements in Arabidopsis is a key step to understanding its transcriptional regulation scheme. In this study, the Arabidopsis gene coexpression network was constructed using the ATTED-II data, and thereafter a subgraph-induced approach and clique-finding algorithm were used to extract gene coexpression groups from the gene coexpression network. A total of 23 large coexpression gene groups were obtained, with each consisting of more than 100 highly correlated genes. Four classical tools were used to predict motifs in the promoter regions of coexpressed genes. Consequently, we detected a large number of candidate biologically relevant regulatory elements, and many of them are consistent with known cis-regulatory elements from AGRIS and AthaMap. Experiments on coexpressed groups, including E2Fa target genes, showed that our method had a high probability of returning the real binding motif. Our study provides the basis for future cis-regulatory module analysis and creates a starting point to unravel regulatory networks of Arabidopsis thaliana.
Collapse
Affiliation(s)
- Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai 200234, China
| | | | | | | |
Collapse
|
43
|
Cserháti M, Turóczy Z, Zombori Z, Cserzo M, Dudits D, Pongor S, Györgyey J. Prediction of new abiotic stress genes in Arabidopsis thaliana and Oryza sativa according to enumeration-based statistical analysis. Mol Genet Genomics 2011; 285:375-91. [PMID: 21437642 DOI: 10.1007/s00438-011-0605-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2010] [Accepted: 01/31/2011] [Indexed: 10/18/2022]
Abstract
Plants undergo an extensive change in gene regulation during abiotic stress. It is of great agricultural importance to know which genes are affected during stress response. The genome sequence of a number of plant species has been determined, among them Arabidopsis and Oryza sativa, whose genome has been annotated most completely as of yet, and are well-known organisms widely used as experimental systems. This paper applies a statistical algorithm for predicting new stress-induced motifs and genes by analyzing promoter sets co-regulated by abiotic stress in the previously mentioned two species. After identifying characteristic putative regulatory motif sequence pairs (dyads) in the promoters of 125 stress-regulated Arabidopsis genes and 87 O. sativa genes, these dyads were used to screen the entire Arabidopsis and O. sativa promoteromes to find related stress-induced genes whose promoters contained a large number of these dyads found by our algorithm. We were able to predict a number of putative dyads, characteristic of a large number of stress-regulated genes, some of them newly discovered by our algorithm and serve as putative transcription factor binding sites. Our new motif prediction algorithm comes complete with a stand-alone program. This algorithm may be used in motif discovery in the future in other species. The more than 1,200 Arabidopsis and 1,700 Orzya sativa genes found by our algorithm are good candidates for further experimental studies in abiotic stress.
Collapse
Affiliation(s)
- Mátyás Cserháti
- Biological Research Center, Institute of Plant Biology, Hungarian Academy of Sciences, P.O. BOX 521, Temesvári Krt. 62, 6701 Szeged, Hungary.
| | | | | | | | | | | | | |
Collapse
|
44
|
CHEN RM, HOU MT, CHANG NW, CHEN YT, TSAI JEFFREYJP. CUMULATIVE SPECTRAL REPEAT FINDER (CSRF): A SPECTRAL APPROACH FOR IDENTIFYING THE LENGTH OF REPEATS IN DNA SEQUENCES. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213011000073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Repetitive sequences of DNA are meaningful and of great importance to human functions. Previous researchers have proposed various methods to discover repetitive sequences in DNA sequence. However, the unknown lengths for repetitive sequences are usually predicted randomly or determined by rules of thumb rather than using a systematical criterion. We propose a new algorithm based on the cumulative Fourier spectral contents of DNA sequence to identify the candidate lengths of repetitive sequences or repeats in DNA sequences. After the candidate lengths of repeats are known, one can identify the repeats and their copy numbers using an exact method. Both of the simulated and real datasets are used to illustrate the performance of the proposed algorithm. The results are also compared to two well-known methods such as Spectral Repeat Finder (SRF) and Gibbs sampler. Furthermore, we demonstrate the use of CSRF in some well-known repeats-finding methods such as SRF, Gibbs sampler, MEME.
Collapse
Affiliation(s)
- R. M. CHEN
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - M. T. HOU
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - N. W. CHANG
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - Y. T. CHEN
- Department of Computer Science and Information Engineering, National University of Tainan, Tainan, Taiwan 70005, Taiwan
| | - JEFFREY J. P. TSAI
- Department of Computer Science, University of Illinois, Chicago, Chicago, IL 60607, USA
- Department of Bioinformatics, Asia University, Taichung, Taiwan 41354, Taiwan
| |
Collapse
|
45
|
When needles look like hay: how to find tissue-specific enhancers in model organism genomes. Dev Biol 2010; 350:239-54. [PMID: 21130761 DOI: 10.1016/j.ydbio.2010.11.026] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2010] [Revised: 11/11/2010] [Accepted: 11/22/2010] [Indexed: 01/22/2023]
Abstract
A major prerequisite for the investigation of tissue-specific processes is the identification of cis-regulatory elements. No generally applicable technique is available to distinguish them from any other type of genomic non-coding sequence. Therefore, researchers often have to identify these elements by elaborate in vivo screens, testing individual regions until the right one is found. Here, based on many examples from the literature, we summarize how functional enhancers have been isolated from other elements in the genome and how they have been characterized in transgenic animals. Covering computational and experimental studies, we provide an overview of the global properties of cis-regulatory elements, like their specific interactions with promoters and target gene distances. We describe conserved non-coding elements (CNEs) and their internal structure, nucleotide composition, binding site clustering and overlap, with a special focus on developmental enhancers. Conflicting data and unresolved questions on the nature of these elements are highlighted. Our comprehensive overview of the experimental shortcuts that have been found in the different model organism communities and the new field of high-throughput assays should help during the preparation phase of a screen for enhancers. The review is accompanied by a list of general guidelines for such a project.
Collapse
|
46
|
Oberto J. FITBAR: a web tool for the robust prediction of prokaryotic regulons. BMC Bioinformatics 2010; 11:554. [PMID: 21070640 PMCID: PMC3098098 DOI: 10.1186/1471-2105-11-554] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2010] [Accepted: 11/11/2010] [Indexed: 11/24/2022] Open
Abstract
Background The binding of regulatory proteins to their specific DNA targets determines the accurate expression of the neighboring genes. The in silico prediction of new binding sites in completely sequenced genomes is a key aspect in the deeper understanding of gene regulatory networks. Several algorithms have been described to discriminate against false-positives in the prediction of new binding targets; however none of them has been implemented so far to assist the detection of binding sites at the genomic scale. Results FITBAR (Fast Investigation Tool for Bacterial and Archaeal Regulons) is a web service designed to identify new protein binding sites on fully sequenced prokaryotic genomes. This tool consists in a workbench where the significance of the predictions can be compared using different statistical methods, a feature not found in existing resources. The Local Markov Model and the Compound Importance Sampling algorithms have been implemented to compute the P-value of newly discovered binding sites. In addition, FITBAR provides two optimized genomic scanning algorithms using either log-odds or entropy-weighted position-specific scoring matrices. Other significant features include the production of a detailed genomic context map for each detected binding site and the export of the search results in spreadsheet and portable document formats. FITBAR discovery of a high affinity Escherichia coli NagC binding site was validated experimentally in vitro as well as in vivo and published. Conclusions FITBAR was developed in order to allow fast, accurate and statistically robust predictions of prokaryotic regulons. This feature constitutes the main advantage of this web tool over other matrix search programs and does not impair its performance. The web service is available at http://archaea.u-psud.fr/fitbar.
Collapse
Affiliation(s)
- Jacques Oberto
- Université Paris-Sud 11, Centre National de la Recherche Scientifique, UMR 8621, Institut de Génétique et Microbiologie, Orsay, France.
| |
Collapse
|
47
|
Li G, Chan TM, Leung KS, Lee KH. A cluster refinement algorithm for motif discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:654-668. [PMID: 21030733 DOI: 10.1109/tcbb.2009.25] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Finding Transcription Factor Binding Sites, i.e., motif discovery, is crucial for understanding the gene regulatory relationship. Motifs are weakly conserved and motif discovery is an NP-hard problem. We propose a new approach called Cluster Refinement Algorithm for Motif Discovery (CRMD). CRMD employs a flexible statistical motif model allowing a variable number of motifs and motif instances. CRMD first uses a novel entropy-based clustering to find complete and good starting candidate motifs from the DNA sequences. CRMD then employs an effective greedy refinement to search for optimal motifs from the candidate motifs. The refinement is fast, and it changes the number of motif instances based on the adaptive thresholds. The performance of CRMD is further enhanced if the problem has one occurrence of motif instance per sequence. Using an appropriate similarity test of motifs, CRMD is also able to find multiple motifs. CRMD has been tested extensively on synthetic and real data sets. The experimental results verify that CRMD usually outperforms four other state-of-the-art algorithms in terms of the qualities of the solutions with competitive computing time. It finds a good balance between finding true motif instances and screening false motif instances, and is robust on problems of various levels of difficulty.
Collapse
Affiliation(s)
- Gang Li
- Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong.
| | | | | | | |
Collapse
|
48
|
Mason MJ, Plath K, Zhou Q. Identification of context-dependent motifs by contrasting ChIP binding data. ACTA ACUST UNITED AC 2010; 26:2826-32. [PMID: 20870645 DOI: 10.1093/bioinformatics/btq546] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION DNA binding proteins play crucial roles in the regulation of gene expression. Transcription factors (TFs) activate or repress genes directly while other proteins influence chromatin structure for transcription. Binding sites of a TF exhibit a similar sequence pattern called a motif. However, a one-to-one map does not exist between each TF and motif. Many TFs in a protein family may recognize the same motif with subtle nucleotide differences leading to different binding affinities. Additionally, a particular TF may bind different motifs under certain conditions, for example in the presence of different co-regulators. The availability of genome-wide binding data of multiple collaborative TFs makes it possible to detect such context-dependent motifs. RESULTS We developed a contrast motif finder (CMF) for the de novo identification of motifs that are differentially enriched in two sets of sequences. Applying this method to a number of TF binding datasets from mouse embryonic stem cells, we demonstrate that CMF achieves substantially higher accuracy than several well-known motif finding methods. By contrasting sequences bound by distinct sets of TFs, CMF identified two different motifs that may be recognized by Oct4 dependent on the presence of another co-regulator and detected subtle motif signals that may be associated with potential competitive binding between Sox2 and Tcf3. AVAILABILITY The software CMF is freely available for academic use at www.stat.ucla.edu/∼zhou/CMF.
Collapse
Affiliation(s)
- Mike J Mason
- Department of Statistics, University of California, Los Angeles, CA 90095, USA
| | | | | |
Collapse
|
49
|
Klepper K, Drabløs F. PriorsEditor: a tool for the creation and use of positional priors in motif discovery. Bioinformatics 2010; 26:2195-7. [PMID: 20628076 PMCID: PMC2922893 DOI: 10.1093/bioinformatics/btq357] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Summary: Computational methods designed to discover transcription factor binding sites in DNA sequences often have a tendency to make a lot of false predictions. One way to improve accuracy in motif discovery is to rely on positional priors to focus the search to parts of a sequence that are considered more likely to contain functional binding sites. We present here a program called PriorsEditor that can be used to create such positional priors tracks based on a combination of several features, including phylogenetic conservation, nucleosome occupancy, histone modifications, physical properties of the DNA helix and many more. Availability: PriorsEditor is available as a web start application and downloadable archive from http://tare.medisin.ntnu.no/priorseditor (requires Java 1.6). The web site also provides tutorials, screenshots and example protocol scripts. Contact:kjetil.klepper@ntnu.no
Collapse
Affiliation(s)
- Kjetil Klepper
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway.
| | | |
Collapse
|
50
|
Ma PC, Chan KC. Discovering Interesting Motif-Sets for Multi-Class Protein Sequence Classification. J Comput Biol 2010; 17:733-43. [DOI: 10.1089/cmb.2008.0213] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Patrick C.H. Ma
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China
| | - Keith C.C. Chan
- Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China
| |
Collapse
|