1
|
Tahara S, Tsuchiya T, Matsumoto H, Ozaki H. Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans. BMC Genomics 2023; 24:597. [PMID: 37805453 PMCID: PMC10560430 DOI: 10.1186/s12864-023-09692-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 09/21/2023] [Indexed: 10/09/2023] Open
Abstract
BACKGROUND Transcription factors (TFs) exhibit heterogeneous DNA-binding specificities in individual cells and whole organisms under natural conditions, and de novo motif discovery usually provides multiple motifs, even from a single chromatin immunoprecipitation-sequencing (ChIP-seq) sample. Despite the accumulation of ChIP-seq data and ChIP-seq-derived motifs, the diversity of DNA-binding specificities across different TFs and cell types remains largely unexplored. RESULTS Here, we applied MOCCS2, our k-mer-based motif discovery method, to a collection of human TF ChIP-seq samples across diverse TFs and cell types, and systematically computed profiles of TF-binding specificity scores for all k-mers. After quality control, we compiled a set of TF-binding specificity score profiles for 2,976 high-quality ChIP-seq samples, comprising 473 TFs and 398 cell types. Using these high-quality samples, we confirmed that the k-mer-based TF-binding specificity profiles reflected TF- or TF-family dependent DNA-binding specificities. We then compared the binding specificity scores of ChIP-seq samples with the same TFs but with different cell type classes and found that half of the analyzed TFs exhibited differences in DNA-binding specificities across cell type classes. Additionally, we devised a method to detect differentially bound k-mers between two ChIP-seq samples and detected k-mers exhibiting statistically significant differences in binding specificity scores. Moreover, we demonstrated that differences in the binding specificity scores between k-mers on the reference and alternative alleles could be used to predict the effect of variants on TF binding, as validated by in vitro and in vivo assay datasets. Finally, we demonstrated that binding specificity score differences can be used to interpret disease-associated non-coding single-nucleotide polymorphisms (SNPs) as TF-affecting SNPs and provide candidates responsible for TFs and cell types. CONCLUSIONS Our study provides a basis for investigating the regulation of gene expression in a TF-, TF family-, or cell-type-dependent manner. Furthermore, our differential analysis of binding-specificity scores highlights noncoding disease-associated variants in humans.
Collapse
Affiliation(s)
- Saeko Tahara
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- School of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Takaho Tsuchiya
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Hirotaka Matsumoto
- School of Information and Data Sciences, Nagasaki University, 1-14, Bunkyo-Machi, Nagasaki City, Nagasaki, 852-8521, Japan
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan
| | - Haruka Ozaki
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan.
| |
Collapse
|
2
|
Jiang Z, Li X, Guo L. Binning Metagenomic Contigs Using Unsupervised Clustering and Reference Databases. Interdiscip Sci 2022; 14:795-803. [PMID: 35639335 DOI: 10.1007/s12539-022-00526-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2021] [Revised: 04/23/2022] [Accepted: 04/27/2022] [Indexed: 06/15/2023]
Abstract
Metagenomics can directly extract the genetic material of all microorganisms from the environment, and obtain metagenomic samples with a large number of unknown DNA sequences. Binning of metagenomic contigs is a hot topic in metagenomics research. There are two key challenges for the current unsupervised metagenomic clustering algorithms. First, unsupervised metagenomic clustering methods rarely use reference databases, causing a certain waste of resources. Second, unsupervised metagenomic clustering methods are restricted by the characteristics of the sequences and the clustering algorithms, and the binning effect is limited. Therefore, a new binning method for metagenomic contigs using unsupervised clustering methods and reference databases is proposed to address these challenges, to make full use of the advantages of unsupervised clustering methods and reference databases constructed by scientists to improve the overall binning effect. This method uses the integrated SVM classification model to further bin the unsupervised clustering parts that do not perform well. Our proposed method was tested on simulated datasets and a real dataset and compared with other state-of-the-art metagenomic clustering methods including CONCOCT, Metabin2.0, Autometa, and MetaBAT. The results show that our method can achieve higher precision rate and improve the binning effect.
Collapse
Affiliation(s)
- Zhongjun Jiang
- College of Information Science and Technology, Ningbo University, Ningbo, 315211, China
| | - Xiaobo Li
- College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua, 321004, China.
| | - Lijun Guo
- College of Information Science and Technology, Ningbo University, Ningbo, 315211, China
| |
Collapse
|
3
|
Toivonen J, Das PK, Taipale J, Ukkonen E. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs. Bioinformatics 2020; 36:2690-2696. [PMID: 31999322 PMCID: PMC7203737 DOI: 10.1093/bioinformatics/btaa045] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2019] [Revised: 12/23/2019] [Accepted: 01/23/2020] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. RESULTS We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average. AVAILABILITY AND IMPLEMENTATION Software implementation is available from https://github.com/jttoivon/moder2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jarkko Toivonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| | - Pratyush K Das
- Applied Tumor Genomics, Research Programs Unit, University of Helsinki, Helsinki FI-00014, Finland
| | - Jussi Taipale
- Department of Biochemistry, University of Cambridge, CB2 1GA Cambridge, UK
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, SE 141 83 Stockholm, Sweden
- Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
- Genome-Scale Biology Program, University of Helsinki, Helsinki FI-00014, Finland
| | - Esko Ukkonen
- Department of Computer Science, University of Helsinki, Helsinki FI-00014, Finland
| |
Collapse
|
4
|
Ruan X, Zhou D, Nie R, Hou R, Cao Z. Prediction of apoptosis protein subcellular location based on position-specific scoring matrix and isometric mapping algorithm. Med Biol Eng Comput 2019; 57:2553-2565. [PMID: 31621050 DOI: 10.1007/s11517-019-02045-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2018] [Accepted: 09/04/2019] [Indexed: 01/04/2023]
Abstract
Apoptosis proteins are related to many diseases. Obtaining the subcellular localization information of apoptosis proteins is helpful to understand the mechanism of diseases and to develop new drugs. At present, the researchers mainly focus on the primary protein sequences, so there is still room for improvement in the prediction accuracy of the subcellular localization of apoptosis proteins. In this paper, a new method named ERT-ECT-PSSM-IS is proposed to predict apoptosis proteins based on the position-specific scoring matrix (PSSM). First, the local and global features of different directions are extracted by evolutionary row transformation (ERT) and cross-covariance of evolutionary column transformation (ECT) based on PSSM (ERT-ECT-PSSM). Second, an improved isometric mapping algorithm (I-SMA) is used to eliminate redundant features. Finally, we adopt a support vector machine (SVM) to classify our results, and the prediction accuracy is evaluated by jackknife cross-validation tests. The experimental results show that the proposed method not only extracts more abundant feature expression but also has better predictive performance and robustness for the subcellular localization of apoptosis proteins in ZD98, ZW225, and CL317 databases. Graphical abstract Framework of the proposed prediction model.
Collapse
Affiliation(s)
- Xiaoli Ruan
- Information College, Yunnan University, Kunming, 650504, China
| | - Dongming Zhou
- Information College, Yunnan University, Kunming, 650504, China.
| | - Rencan Nie
- Information College, Yunnan University, Kunming, 650504, China
| | - Ruichao Hou
- Information College, Yunnan University, Kunming, 650504, China
| | - Zicheng Cao
- School of Public Health, Sun Yat-sen University, Shenzhen, 510080, China
| |
Collapse
|
5
|
Toivonen J, Kivioja T, Jolma A, Yin Y, Taipale J, Ukkonen E. Modular discovery of monomeric and dimeric transcription factor binding motifs for large data sets. Nucleic Acids Res 2019; 46:e44. [PMID: 29385521 PMCID: PMC5934673 DOI: 10.1093/nar/gky027] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Accepted: 01/12/2018] [Indexed: 01/06/2023] Open
Abstract
In some dimeric cases of transcription factor (TF) binding, the specificity of dimeric motifs has been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other. Current motif discovery methods are unable to learn monomeric and dimeric motifs in modular fashion such that deviations from the expected motif would become explicit and the noise from dimeric occurrences would not corrupt monomeric models. We propose a novel modeling technique and an expectation maximization algorithm, implemented as software tool MODER, for discovering monomeric TF binding motifs and their dimeric combinations. Given training data and seeds for monomeric motifs, the algorithm learns in the same probabilistic framework a mixture model which represents monomeric motifs as standard position-specific probability matrices (PPMs), and dimeric motifs as pairs of monomeric PPMs, with associated orientation and spacing preferences. For dimers the model represents deviations from pure modular model of two independent monomers, thus making co-operative binding effects explicit. MODER can analyze in reasonable time tens of Mbps of training data. We validated the tool on HT-SELEX and ChIP-seq data. Our findings include some TFs whose expected model has palindromic symmetry but the observed model is directional.
Collapse
Affiliation(s)
- Jarkko Toivonen
- Department of Computer Science, P.O. Box 68, FI-00014 University of Helsinki, Helsinki, Finland
| | - Teemu Kivioja
- Genome-Scale Biology Program, P.O. Box 63, FI-00014 University of Helsinki, Helsinki, Finland
| | - Arttu Jolma
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, and Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
| | - Yimeng Yin
- Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, and Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden
| | - Jussi Taipale
- Genome-Scale Biology Program, P.O. Box 63, FI-00014 University of Helsinki, Helsinki, Finland.,Division of Functional Genomics and Systems Biology, Department of Medical Biochemistry and Biophysics, and Department of Biosciences and Nutrition, Karolinska Institutet, SE 141 83 Stockholm, Sweden.,Department of Biochemistry, University of Cambridge, CB2 1GA Cambridge, UK
| | - Esko Ukkonen
- Department of Computer Science, P.O. Box 68, FI-00014 University of Helsinki, Helsinki, Finland.,Helsinki Institute for Information Technology HIIT, University of Helsinki & Aalto University, Helsinki, Finland
| |
Collapse
|
6
|
Chapman MP, Risom T, Aswani AJ, Langer EM, Sears RC, Tomlin CJ. Modeling differentiation-state transitions linked to therapeutic escape in triple-negative breast cancer. PLoS Comput Biol 2019; 15:e1006840. [PMID: 30856168 PMCID: PMC6428348 DOI: 10.1371/journal.pcbi.1006840] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2018] [Revised: 03/21/2019] [Accepted: 02/05/2019] [Indexed: 11/18/2022] Open
Abstract
Drug resistance in breast cancer cell populations has been shown to arise through phenotypic transition of cancer cells to a drug-tolerant state, for example through epithelial-to-mesenchymal transition or transition to a cancer stem cell state. However, many breast tumors are a heterogeneous mixture of cell types with numerous epigenetic states in addition to stem-like and mesenchymal phenotypes, and the dynamic behavior of this heterogeneous mixture in response to drug treatment is not well-understood. Recently, we showed that plasticity between differentiation states, as identified with intracellular markers such as cytokeratins, is linked to resistance to specific targeted therapeutics. Understanding the dynamics of differentiation-state transitions in this context could facilitate the development of more effective treatments for cancers that exhibit phenotypic heterogeneity and plasticity. In this work, we develop computational models of a drug-treated, phenotypically heterogeneous triple-negative breast cancer (TNBC) cell line to elucidate the feasibility of differentiation-state transition as a mechanism for therapeutic escape in this tumor subtype. Specifically, we use modeling to predict the changes in differentiation-state transitions that underlie specific therapy-induced changes in differentiation-state marker expression that we recently observed in the HCC1143 cell line. We report several statistically significant therapy-induced changes in transition rates between basal, luminal, mesenchymal, and non-basal/non-luminal/non-mesenchymal differentiation states in HCC1143 cell populations. Moreover, we validate model predictions on cell division and cell death empirically, and we test our models on an independent data set. Overall, we demonstrate that changes in differentiation-state transition rates induced by targeted therapy can provoke distinct differentiation-state aggregations of drug-resistant cells, which may be fundamental to the design of improved therapeutic regimens for cancers with phenotypic heterogeneity.
Collapse
Affiliation(s)
- Margaret P. Chapman
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, United States of America
- * E-mail:
| | - Tyler Risom
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, Oregon, United States of America
| | - Anil J. Aswani
- Department of Industrial Engineering and Operations Research, University of California Berkeley, Berkeley, California, United States of America
| | - Ellen M. Langer
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, Oregon, United States of America
| | - Rosalie C. Sears
- Department of Molecular and Medical Genetics, Oregon Health and Science University, Portland, Oregon, United States of America
- Knight Cancer Institute, Oregon Health and Science University, Portland, Oregon, United States of America
- Center for Spatial Systems Biomedicine, Oregon Health and Science University, Portland, Oregon, United States of America
| | - Claire J. Tomlin
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, California, United States of America
| |
Collapse
|
7
|
Shahmuradov IA, Mohamad Razali R, Bougouffa S, Radovanovic A, Bajic VB. bTSSfinder: a novel tool for the prediction of promoters in cyanobacteria and Escherichia coli. Bioinformatics 2017; 33:334-340. [PMID: 27694198 PMCID: PMC5408793 DOI: 10.1093/bioinformatics/btw629] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 09/27/2016] [Indexed: 12/01/2022] Open
Abstract
Motivation The computational search for promoters in prokaryotes remains an attractive problem in bioinformatics. Despite the attention it has received for many years, the problem has not been addressed satisfactorily. In any bacterial genome, the transcription start site is chosen mostly by the sigma (σ) factor proteins, which control the gene activation. The majority of published bacterial promoter prediction tools target σ70 promoters in Escherichia coli. Moreover, no σ-specific classification of promoters is available for prokaryotes other than for E. coli. Results Here, we introduce bTSSfinder, a novel tool that predicts putative promoters for five classes of σ factors in Cyanobacteria (σA, σC, σH, σG and σF) and for five classes of sigma factors in E. coli (σ70, σ38, σ32, σ28 and σ24). Comparing to currently available tools, bTSSfinder achieves higher accuracy (MCC = 0.86, F1-score = 0.93) compared to the next best tool with MCC = 0.59, F1-score = 0.79) and covers multiple classes of promoters. Availability and Implementation bTSSfinder is available standalone and online at http://www.cbrc.kaust.edu.sa/btssfinder. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ilham Ayub Shahmuradov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Rozaimi Mohamad Razali
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Salim Bougouffa
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Aleksandar Radovanovic
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| | - Vladimir B Bajic
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), 4700 King Abdullah University of Science and Technology (KAUST), Thuwal, Kingdom of Saudi Arabia
| |
Collapse
|
8
|
Shahmuradov IA, Umarov RK, Solovyev VV. TSSPlant: a new tool for prediction of plant Pol II promoters. Nucleic Acids Res 2017; 45:e65. [PMID: 28082394 PMCID: PMC5416875 DOI: 10.1093/nar/gkw1353] [Citation(s) in RCA: 38] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2016] [Revised: 12/16/2016] [Accepted: 12/27/2016] [Indexed: 11/22/2022] Open
Abstract
Our current knowledge of eukaryotic promoters indicates their complex architecture that is often composed of numerous functional motifs. Most of known promoters include multiple and in some cases mutually exclusive transcription start sites (TSSs). Moreover, TSS selection depends on cell/tissue, development stage and environmental conditions. Such complex promoter structures make their computational identification notoriously difficult. Here, we present TSSPlant, a novel tool that predicts both TATA and TATA-less promoters in sequences of a wide spectrum of plant genomes. The tool was developed by using large promoter collections from ppdb and PlantProm DB. It utilizes eighteen significant compositional and signal features of plant promoter sequences selected in this study, that feed the artificial neural network-based model trained by the backpropagation algorithm. TSSPlant achieves significantly higher accuracy compared to the next best promoter prediction program for both TATA promoters (MCC≃0.84 and F1-score≃0.91 versus MCC≃0.51 and F1-score≃0.71) and TATA-less promoters (MCC≃0.80, F1-score≃0.89 versus MCC≃0.29 and F1-score≃0.50). TSSPlant is available to download as a standalone program at http://www.cbrc.kaust.edu.sa/download/.
Collapse
Affiliation(s)
- Ilham A. Shahmuradov
- King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
- Institue of Molecular Biology and Biotechnologies, ANAS, 2 Matbuat strasse, Baku AZ1073, Azerbaijan
| | - Ramzan Kh. Umarov
- King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia
| | | |
Collapse
|
9
|
Jayaram N, Usvyat D, R Martin AC. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics 2016; 17:547. [PMID: 27806697 PMCID: PMC6889335 DOI: 10.1186/s12859-016-1298-9] [Citation(s) in RCA: 56] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2016] [Accepted: 10/20/2016] [Indexed: 12/21/2022] Open
Abstract
Background Binding of transcription factors to transcription factor binding sites (TFBSs) is key to the mediation of transcriptional regulation. Information on experimentally validated functional TFBSs is limited and consequently there is a need for accurate prediction of TFBSs for gene annotation and in applications such as evaluating the effects of single nucleotide variations in causing disease. TFBSs are generally recognized by scanning a position weight matrix (PWM) against DNA using one of a number of available computer programs. Thus we set out to evaluate the best tools that can be used locally (and are therefore suitable for large-scale analyses) for creating PWMs from high-throughput ChIP-Seq data and for scanning them against DNA. Results We evaluated a set of de novo motif discovery tools that could be downloaded and installed locally using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool. TFBS prediction tools used to scan PWMs against DNA fall into two classes — those that predict individual TFBSs and those that identify clusters. Our evaluation showed that FIMO and MCAST performed best respectively. Conclusions Selection of the best-performing tools for generating PWMs from ChIP-Seq data and for scanning PWMs against DNA has the potential to improve prediction of precise transcription factor binding sites within regions identified by ChIP-Seq experiments for gene finding, understanding regulation and in evaluating the effects of single nucleotide variations in causing disease. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1298-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Narayan Jayaram
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Daniel Usvyat
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK
| | - Andrew C R Martin
- Institute of Structural and Molecular Biology, Division of Biosciences, University College London, Darwin Building, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
10
|
O'Neill PK, Erill I. Parametric bootstrapping for biological sequence motifs. BMC Bioinformatics 2016; 17:406. [PMID: 27716039 PMCID: PMC5052923 DOI: 10.1186/s12859-016-1246-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2016] [Accepted: 09/08/2016] [Indexed: 11/10/2022] Open
Abstract
Background Biological sequence motifs drive the specific interactions of proteins and nucleic acids. Accordingly, the effective computational discovery and analysis of such motifs is a central theme in bioinformatics. Many practical questions about the properties of motifs can be recast as random sampling problems. In this light, the task is to determine for a given motif whether a certain feature of interest is statistically unusual among relevantly similar alternatives. Despite the generality of this framework, its use has been frustrated by the difficulties of defining an appropriate reference class of motifs for comparison and of sampling from it effectively. Results We define two distributions over the space of all motifs of given dimension. The first is the maximum entropy distribution subject to mean information content, and the second is the truncated uniform distribution over all motifs having information content within a given interval. We derive exact sampling algorithms for each. As a proof of concept, we employ these sampling methods to analyze a broad collection of prokaryotic and eukaryotic transcription factor binding site motifs. In addition to positional information content, we consider the informational Gini coefficient of the motif, a measure of the degree to which information is evenly distributed throughout a motif’s positions. We find that both prokaryotic and eukaryotic motifs tend to exhibit higher informational Gini coefficients (IGC) than would be expected by chance under either reference distribution. As a second application, we apply maximum entropy sampling to the motif p-value problem and use it to give elementary derivations of two new estimators. Conclusions Despite the historical centrality of biological sequence motif analysis, this study constitutes to our knowledge the first use of principled null hypotheses for sequence motifs given information content. Through their use, we are able to characterize for the first time differerences in global motif statistics between biological motifs and their null distributions. In particular, we observe that biological sequence motifs show an unusual distribution of IGC, presumably due to biochemical constraints on the mechanisms of direct read-out. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1246-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Patrick K O'Neill
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, 21250, US
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, 21250, US.
| |
Collapse
|
11
|
Bahrami-Samani E, Vo DT, de Araujo PR, Vogel C, Smith AD, Penalva LOF, Uren PJ. Computational challenges, tools, and resources for analyzing co- and post-transcriptional events in high throughput. WILEY INTERDISCIPLINARY REVIEWS. RNA 2015; 6:291-310. [PMID: 25515586 PMCID: PMC4397117 DOI: 10.1002/wrna.1274] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 10/24/2014] [Accepted: 10/29/2014] [Indexed: 11/10/2022]
Abstract
Co- and post-transcriptional regulation of gene expression is complex and multifaceted, spanning the complete RNA lifecycle from genesis to decay. High-throughput profiling of the constituent events and processes is achieved through a range of technologies that continue to expand and evolve. Fully leveraging the resulting data is nontrivial, and requires the use of computational methods and tools carefully crafted for specific data sources and often intended to probe particular biological processes. Drawing upon databases of information pre-compiled by other researchers can further elevate analyses. Within this review, we describe the major co- and post-transcriptional events in the RNA lifecycle that are amenable to high-throughput profiling. We place specific emphasis on the analysis of the resulting data, in particular the computational tools and resources available, as well as looking toward future challenges that remain to be addressed.
Collapse
Affiliation(s)
- Emad Bahrami-Samani
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA
| | - Dat T. Vo
- Children’s Cancer Research Institute and Department of Cellular and Structural Biology, University of Texas Health Science Center, San Antonio, TX
| | - Patricia Rosa de Araujo
- Children’s Cancer Research Institute and Department of Cellular and Structural Biology, University of Texas Health Science Center, San Antonio, TX
| | - Christine Vogel
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY
| | - Andrew D. Smith
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA
| | - Luiz O. F. Penalva
- Children’s Cancer Research Institute and Department of Cellular and Structural Biology, University of Texas Health Science Center, San Antonio, TX
| | - Philip J. Uren
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA
| |
Collapse
|
12
|
Huang WL, Tung CW, Liaw C, Huang HL, Ho SY. Rule-based knowledge acquisition method for promoter prediction in human and Drosophila species. ScientificWorldJournal 2014; 2014:327306. [PMID: 24955394 PMCID: PMC3927563 DOI: 10.1155/2014/327306] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2013] [Accepted: 10/10/2013] [Indexed: 01/08/2023] Open
Abstract
The rapid and reliable identification of promoter regions is important when the number of genomes to be sequenced is increasing very speedily. Various methods have been developed but few methods investigate the effectiveness of sequence-based features in promoter prediction. This study proposes a knowledge acquisition method (named PromHD) based on if-then rules for promoter prediction in human and Drosophila species. PromHD utilizes an effective feature-mining algorithm and a reference feature set of 167 DNA sequence descriptors (DNASDs), comprising three descriptors of physicochemical properties (absorption maxima, molecular weight, and molar absorption coefficient), 128 top-ranked descriptors of 4-mer motifs, and 36 global sequence descriptors. PromHD identifies two feature subsets with 99 and 74 DNASDs and yields test accuracies of 96.4% and 97.5% in human and Drosophila species, respectively. Based on the 99- and 74-dimensional feature vectors, PromHD generates several if-then rules by using the decision tree mechanism for promoter prediction. The top-ranked informative rules with high certainty grades reveal that the global sequence descriptor, the length of nucleotide A at the first position of the sequence, and two physicochemical properties, absorption maxima and molecular weight, are effective in distinguishing promoters from non-promoters in human and Drosophila species, respectively.
Collapse
Affiliation(s)
- Wen-Lin Huang
- Department of Management Information System, Asia Pacific Institute of Creativity, Miaoli 351, Taiwan
| | - Chun-Wei Tung
- School of Pharmacy, College of Pharmacy, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Chyn Liaw
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Hui-Ling Huang
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| | - Shinn-Ying Ho
- Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu 300, Taiwan
- Department of Biological Science and Technology, National Chiao Tung University, Hsinchu 300, Taiwan
| |
Collapse
|
13
|
Liu W, Chen H, Chen L. An ant colony optimization based algorithm for identifying gene regulatory elements. Comput Biol Med 2013; 43:922-32. [PMID: 23746735 DOI: 10.1016/j.compbiomed.2013.04.008] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2011] [Revised: 04/10/2013] [Accepted: 04/11/2013] [Indexed: 11/15/2022]
Abstract
It is one of the most important tasks in bioinformatics to identify the regulatory elements in gene sequences. Most of the existing algorithms for identifying regulatory elements are inclined to converge into a local optimum, and have high time complexity. Ant Colony Optimization (ACO) is a meta-heuristic method based on swarm intelligence and is derived from a model inspired by the collective foraging behavior of real ants. Taking advantage of the ACO in traits such as self-organization and robustness, this paper designs and implements an ACO based algorithm named ACRI (ant-colony-regulatory-identification) for identifying all possible binding sites of transcription factor from the upstream of co-expressed genes. To accelerate the ants' searching process, a strategy of local optimization is presented to adjust the ants' start positions on the searched sequences. By exploiting the powerful optimization ability of ACO, the algorithm ACRI can not only improve precision of the results, but also achieve a very high speed. Experimental results on real world datasets show that ACRI can outperform other traditional algorithms in the respects of speed and quality of solutions.
Collapse
Affiliation(s)
- Wei Liu
- Department of Computer Science and Engineering, Southeast University, Nanjing 210096, China.
| | | | | |
Collapse
|
14
|
Ju L, Wang YD, Hung Y, Wu CFJ, Zhu C. An HMM-based algorithm for evaluating rates of receptor-ligand binding kinetics from thermal fluctuation data. Bioinformatics 2013; 29:1511-8. [PMID: 23599504 DOI: 10.1093/bioinformatics/btt180] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Abrupt reduction/resumption of thermal fluctuations of a force probe has been used to identify association/dissociation events of protein-ligand bonds. We show that off-rate of molecular dissociation can be estimated by the analysis of the bond lifetime, while the on-rate of molecular association can be estimated by the analysis of the waiting time between two neighboring bond events. However, the analysis relies heavily on subjective judgments and is time-consuming. To automate the process of mapping out bond events from thermal fluctuation data, we develop a hidden Markov model (HMM)-based method. RESULTS The HMM method represents the bond state by a hidden variable with two values: bound and unbound. The bond association/dissociation is visualized and pinpointed. We apply the method to analyze a key receptor-ligand interaction in the early stage of hemostasis and thrombosis: the von Willebrand factor (VWF) binding to platelet glycoprotein Ibα (GPIbα). The numbers of bond lifetime and waiting time events estimated by the HMM are much more than those estimated by a descriptive statistical method from the same set of raw data. The kinetic parameters estimated by the HMM are in excellent agreement with those by a descriptive statistical analysis, but have much smaller errors for both wild-type and two mutant VWF-A1 domains. Thus, the computerized analysis allows us to speed up the analysis and improve the quality of estimates of receptor-ligand binding kinetics.
Collapse
Affiliation(s)
- Lining Ju
- Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta 30318, USA
| | | | | | | | | |
Collapse
|
15
|
Abstract
The specificity of protein-DNA interactions is most commonly modeled using position weight matrices (PWMs). First introduced in 1982, they have been adapted to many new types of data and many different approaches have been developed to determine the parameters of the PWM. New high-throughput technologies provide a large amount of data rapidly and offer an unprecedented opportunity to determine accurately the specificities of many transcription factors (TFs). But taking full advantage of the new data requires advanced algorithms that take into account the biophysical processes involved in generating the data. The new large datasets can also aid in determining when the PWM model is inadequate and must be extended to provide accurate predictions of binding sites. This article provides a general mathematical description of a PWM and how it is used to score potential binding sites, a brief history of the approaches that have been developed and the types of data that are used with an emphasis on algorithms that we have developed for analyzing high-throughput datasets from several new technologies. It also describes extensions that can be added when the simple PWM model is inadequate and further enhancements that may be necessary. It briefly describes some applications of PWMs in the discovery and modeling of in vivo regulatory networks.
Collapse
|
16
|
Tan M, Yu D, Jin Y, Dou L, Li B, Wang Y, Yue J, Liang L. An information transmission model for transcription factor binding at regulatory DNA sites. Theor Biol Med Model 2012; 9:19. [PMID: 22672438 PMCID: PMC3442977 DOI: 10.1186/1742-4682-9-19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2012] [Accepted: 05/17/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Computational identification of transcription factor binding sites (TFBSs) is a rapid, cost-efficient way to locate unknown regulatory elements. With increased potential for high-throughput genome sequencing, the availability of accurate computational methods for TFBS prediction has never been as important as it currently is. To date, identifying TFBSs with high sensitivity and specificity is still an open challenge, necessitating the development of novel models for predicting transcription factor-binding regulatory DNA elements. RESULTS Based on the information theory, we propose a model for transcription factor binding of regulatory DNA sites. Our model incorporates position interdependencies in effective ways. The model computes the information transferred (TI) between the transcription factor and the TFBS during the binding process and uses TI as the criterion to determine whether the sequence motif is a possible TFBS. Based on this model, we developed a computational method to identify TFBSs. By theoretically proving and testing our model using both real and artificial data, we found that our model provides highly accurate predictive results. CONCLUSIONS In this study, we present a novel model for transcription factor binding regulatory DNA sites. The model can provide an increased ability to detect TFBSs.
Collapse
Affiliation(s)
- Mingfeng Tan
- Beijing Institute of Biotechnology, Beijing 100071, China
| | | | | | | | | | | | | | | |
Collapse
|
17
|
Finding Transcription Factor Binding Motifs for Coregulated Genes by Combining Sequence Overrepresentation with Cross-Species Conservation. JOURNAL OF PROBABILITY AND STATISTICS 2012. [DOI: 10.1155/2012/830575] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Novel computational methods for finding transcription factor binding motifs have long been sought due to tedious work of experimentally identifying them. However, the current prevailing methods yield a large number of false positive predictions due to the short, variable nature of transcriptional factor binding sites (TFBSs). We proposed here a method that combines sequence overrepresentation and cross-species sequence conservation to detect TFBSs in upstream regions of a given set of coregulated genes. We applied the method to 35S. cerevisiaetranscriptional factors with known DNA binding motifs (with the support of orthologous sequences from genomes ofS. mikatae,S. bayanus, andS. paradoxus), and the proposed method outperformed the single-genome-based motif finding methodsMEMEandAlignACEas well as the multiple-genome-based methodsPHYMEandFootprinterfor the majority of these transcriptional factors. Compared with the prevailing motif finding software, our method has some advantages in finding transcriptional factor binding motifs for potential coregulated genes if the gene upstream sequences of multiple closely related species are available. Although we used yeast genomes to assess our method in this study, it might also be applied to other organisms if suitable related species are available and the upstream sequences of coregulated genes can be obtained for the multiple closely related species.
Collapse
|
18
|
MA QICHENG, WANG JASONTL. BIOLOGICAL DATA MINING USING BAYESIAN NEURAL NETWORKS: A CASE STUDY. INT J ARTIF INTELL T 2011. [DOI: 10.1142/s0218213099000294] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Biological data mining is the activity of finding significant information in biomolecular data. The significant information may refer to motifs, clusters, genes, and protein signatures. This paper presents an example of biological data mining: the recognition of promoters in DNA. We propose a two-level ensemble of classifiers to recognize E. Coli promoter sequences. The first-level classifiers include three Bayesian neural networks that learn from three different feature sets. The outputs of the first-level classifiers are combined in the second-level to give the final result. Empirical study shows that a precision rate of 92.2% is achieved, indicating an excellent performance of the proposed approach.
Collapse
Affiliation(s)
- QICHENG MA
- Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| | - JASON T. L. WANG
- Department of Computer and Information Science, New Jersey Institute of Technology, University Heights, Newark, NJ 07102, USA
| |
Collapse
|
19
|
Mukherjee S, Mitra S. HIDDEN MARKOV MODELS, GRAMMARS, AND BIOLOGY: A TUTORIAL. J Bioinform Comput Biol 2011; 3:491-526. [PMID: 15852517 DOI: 10.1142/s0219720005001077] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2004] [Revised: 01/05/2004] [Accepted: 01/06/2005] [Indexed: 11/18/2022]
Abstract
Biological sequences and structures have been modelled using various machine learning techniques and abstract mathematical concepts. This article surveys methods using Hidden Markov Model and functional grammars for this purpose. We provide a formal introduction to Hidden Markov Model and grammars, stressing on a comprehensive mathematical description of the methods and their natural continuity. The basic algorithms and their application to analyzing biological sequences and modelling structures of bio-molecules like proteins and nucleic acids are discussed. A comparison of the different approaches is discussed, and possible areas of work and problems are highlighted. Related databases and softwares, available on the internet, are also mentioned.
Collapse
Affiliation(s)
- Shibaji Mukherjee
- Association for Studies in Computational Biology, Kolkata 700 018, India.
| | | |
Collapse
|
20
|
Bi C. SEAM: A STOCHASTIC EM-TYPE ALGORITHM FOR MOTIF-FINDING IN BIOPOLYMER SEQUENCES. J Bioinform Comput Biol 2011; 5:47-77. [PMID: 17477491 DOI: 10.1142/s0219720007002527] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2006] [Revised: 08/22/2006] [Accepted: 10/14/2006] [Indexed: 12/21/2022]
Abstract
Position weight matrix-based statistical modeling for the identification and characterization of motif sites in a set of unaligned biopolymer sequences is presented. This paper describes and implements a new algorithm, the Stochastic EM-type Algorithm for Motif-finding (SEAM), and redesigns and implements the EM-based motif-finding algorithm called deterministic EM (DEM) for comparison with SEAM, its stochastic counterpart. The gold standard example, cyclic adenosine monophosphate receptor protein (CRP) binding sequences, together with other biological sequences, is used to illustrate the performance of the new algorithm and compare it with other popular motif-finding programs. The convergence of the new algorithm is shown by simulation. The in silico experiments using simulated and biological examples illustrate the power and robustness of the new algorithm SEAM in de novo motif discovery.
Collapse
Affiliation(s)
- Chengpeng Bi
- Children's Mercy Hospitals and Clinics, 2401 Gillham Road, Pediatrics Research Building, Third Floor, Kansas City, Missouri 64108, USA.
| |
Collapse
|
21
|
Akhtar MN, Bukhari SA, Fazal Z, Qamar R, Shahmuradov IA. POLYAR, a new computer program for prediction of poly(A) sites in human sequences. BMC Genomics 2010; 11:646. [PMID: 21092114 PMCID: PMC3053588 DOI: 10.1186/1471-2164-11-646] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2010] [Accepted: 11/19/2010] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND mRNA polyadenylation is an essential step of pre-mRNA processing in eukaryotes. Accurate prediction of the pre-mRNA 3'-end cleavage/polyadenylation sites is important for defining the gene boundaries and understanding gene expression mechanisms. RESULTS 28761 human mapped poly(A) sites have been classified into three classes containing different known forms of polyadenylation signal (PAS) or none of them (PAS-strong, PAS-weak and PAS-less, respectively) and a new computer program POLYAR for the prediction of poly(A) sites of each class was developed. In comparison with polya_svm (till date the most accurate computer program for prediction of poly(A) sites) while searching for PAS-strong poly(A) sites in human sequences, POLYAR had a significantly higher prediction sensitivity (80.8% versus 65.7%) and specificity (66.4% versus 51.7%) However, when a similar sort of search was conducted for PAS-weak and PAS-less poly(A) sites, both programs had a very low prediction accuracy, which indicates that our knowledge about factors involved in the determination of the poly(A) sites is not sufficient to identify such polyadenylation regions. CONCLUSIONS We present a new classification of polyadenylation sites into three classes and a novel computer program POLYAR for prediction of poly(A) sites/regions of each of the class. In tests, POLYAR shows high accuracy of prediction of the PAS-strong poly(A) sites, though this program's efficiency in searching for PAS-weak and PAS-less poly(A) sites is not very high but is comparable to other available programs. These findings suggest that additional characteristics of such poly(A) sites remain to be elucidated. POLYAR program with a stand-alone version for downloading is available at http://cub.comsats.edu.pk/polyapredict.htm.
Collapse
Affiliation(s)
- Malik Nadeem Akhtar
- Department of Biosciences, COMSATS Institute of Information Technology, Islamabad, Pakistan
| | | | | | | | | |
Collapse
|
22
|
|
23
|
|
24
|
Sahota G, Stormo GD. Novel sequence-based method for identifying transcription factor binding sites in prokaryotic genomes. ACTA ACUST UNITED AC 2010; 26:2672-7. [PMID: 20807838 DOI: 10.1093/bioinformatics/btq501] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With next-generation sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be studied experimentally. Most related computational work has been focused on sequence assembly, gene annotation and metabolic network reconstruction. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor (TF) binding specificities. RESULTS Specificity determining residues (critical residues) were identified from crystal structures of DNA-protein complexes and TFs with the same critical residues were grouped into specificity classes. The putative binding regions for each class were defined as the set of promoters for each TF itself (autoregulatory) and the immediately upstream and downstream operons. MEME was used to find putative motifs within each separate class. Tests on the LacI and TetR TF families, using RegulonDB annotated sites, showed the sensitivity of prediction 86% and 80%, respectively. AVAILABILITY http://ural.wustl.edu/∼gsahota/HTHmotif/
Collapse
Affiliation(s)
- Gurmukh Sahota
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO 63108, USA
| | | |
Collapse
|
25
|
Valen E, Sandelin A, Winther O, Krogh A. Discovery of regulatory elements is improved by a discriminatory approach. PLoS Comput Biol 2009; 5:e1000562. [PMID: 19911049 PMCID: PMC2770120 DOI: 10.1371/journal.pcbi.1000562] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2009] [Accepted: 10/13/2009] [Indexed: 01/01/2023] Open
Abstract
A major goal in post-genome biology is the complete mapping of the gene regulatory networks for every organism. Identification of regulatory elements is a prerequisite for realizing this ambitious goal. A common problem is finding regulatory patterns in promoters of a group of co-expressed genes, but contemporary methods are challenged by the size and diversity of regulatory regions in higher metazoans. Two key issues are the small amount of information contained in a pattern compared to the large promoter regions and the repetitive characteristics of genomic DNA, which both lead to "pattern drowning". We present a new computational method for identifying transcription factor binding sites in promoters using a discriminatory approach with a large negative set encompassing a significant sample of the promoters from the relevant genome. The sequences are described by a probabilistic model and the most discriminatory motifs are identified by maximizing the probability of the sets given the motif model and prior probabilities of motif occurrences in both sets. Due to the large number of promoters in the negative set, an enhanced suffix array is used to improve speed and performance. Using our method, we demonstrate higher accuracy than the best of contemporary methods, high robustness when extending the length of the input sequences and a strong correlation between our objective function and the correct solution. Using a large background set of real promoters instead of a simplified model leads to higher discriminatory power and markedly reduces the need for repeat masking; a common pre-processing step for other pattern finders.
Collapse
Affiliation(s)
- Eivind Valen
- The Bioinformatics Centre, Department of Biology and the Biotech Research and Innovation Centre (BRIC), University of Copenhagen, Copenhagen, Denmark.
| | | | | | | |
Collapse
|
26
|
van Hijum SAFT, Medema MH, Kuipers OP. Mechanisms and evolution of control logic in prokaryotic transcriptional regulation. Microbiol Mol Biol Rev 2009; 73:481-509, Table of Contents. [PMID: 19721087 PMCID: PMC2738135 DOI: 10.1128/mmbr.00037-08] [Citation(s) in RCA: 96] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
A major part of organismal complexity and versatility of prokaryotes resides in their ability to fine-tune gene expression to adequately respond to internal and external stimuli. Evolution has been very innovative in creating intricate mechanisms by which different regulatory signals operate and interact at promoters to drive gene expression. The regulation of target gene expression by transcription factors (TFs) is governed by control logic brought about by the interaction of regulators with TF binding sites (TFBSs) in cis-regulatory regions. A factor that in large part determines the strength of the response of a target to a given TF is motif stringency, the extent to which the TFBS fits the optimal TFBS sequence for a given TF. Advances in high-throughput technologies and computational genomics allow reconstruction of transcriptional regulatory networks in silico. To optimize the prediction of transcriptional regulatory networks, i.e., to separate direct regulation from indirect regulation, a thorough understanding of the control logic underlying the regulation of gene expression is required. This review summarizes the state of the art of the elements that determine the functionality of TFBSs by focusing on the molecular biological mechanisms and evolutionary origins of cis-regulatory regions.
Collapse
Affiliation(s)
- Sacha A F T van Hijum
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Kerklaan 30, 9751 NN Haren, The Netherlands.
| | | | | |
Collapse
|
27
|
Narlikar L, Ovcharenko I. Identifying regulatory elements in eukaryotic genomes. BRIEFINGS IN FUNCTIONAL GENOMICS AND PROTEOMICS 2009; 8:215-30. [PMID: 19498043 DOI: 10.1093/bfgp/elp014] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
Proper development and functioning of an organism depends on precise spatial and temporal expression of all its genes. These coordinated expression-patterns are maintained primarily through the process of transcriptional regulation. Transcriptional regulation is mediated by proteins binding to regulatory elements on the DNA in a combinatorial manner, where particular combinations of transcription factor binding sites establish specific regulatory codes. In this review, we survey experimental and computational approaches geared towards the identification of proximal and distal gene regulatory elements in the genomes of complex eukaryotes. Available approaches that decipher the genetic structure and function of regulatory elements by exploiting various sources of information like gene expression data, chromatin structure, DNA-binding specificities of transcription factors, cooperativity of transcription factors, etc. are highlighted. We also discuss the relevance of regulatory elements in the context of human health through examples of mutations in some of these regions having serious implications in misregulation of genes and being strongly associated with human disorders.
Collapse
Affiliation(s)
- Leelavati Narlikar
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|
28
|
Dekhtyar M, Morin A, Sakanyan V. Triad pattern algorithm for predicting strong promoter candidates in bacterial genomes. BMC Bioinformatics 2008; 9:233. [PMID: 18471287 PMCID: PMC2412878 DOI: 10.1186/1471-2105-9-233] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2007] [Accepted: 05/09/2008] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Bacterial promoters, which increase the efficiency of gene expression, differ from other promoters by several characteristics. This difference, not yet widely exploited in bioinformatics, looks promising for the development of relevant computational tools to search for strong promoters in bacterial genomes. RESULTS We describe a new triad pattern algorithm that predicts strong promoter candidates in annotated bacterial genomes by matching specific patterns for the group I sigma70 factors of Escherichia coli RNA polymerase. It detects promoter-specific motifs by consecutively matching three patterns, consisting of an UP-element, required for interaction with the alpha subunit, and then optimally-separated patterns of -35 and -10 boxes, required for interaction with the sigma70 subunit of RNA polymerase. Analysis of 43 bacterial genomes revealed that the frequency of candidate sequences depends on the A+T content of the DNA under examination. The accuracy of in silico prediction was experimentally validated for the genome of a hyperthermophilic bacterium, Thermotoga maritima, by applying a cell-free expression assay using the predicted strong promoters. In this organism, the strong promoters govern genes for translation, energy metabolism, transport, cell movement, and other as-yet unidentified functions. CONCLUSION The triad pattern algorithm developed for predicting strong bacterial promoters is well suited for analyzing bacterial genomes with an A+T content of less than 62%. This computational tool opens new prospects for investigating global gene expression, and individual strong promoters in bacteria of medical and/or economic significance.
Collapse
Affiliation(s)
| | - Amelie Morin
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
| | - Vehary Sakanyan
- Laboratoire de Biotechnologie, UMR CNRS 6204, Université de Nantes, 2 rue de la Houssinière, 44322 Nantes, France
- ProtNeteomix, 2 rue de la Houssinière, 44322 Nantes, France
| |
Collapse
|
29
|
Durga Bhavani S, Sobha Rani T, Bapi RS. Feature selection using correlation fractal dimension: Issues and applications in binary classification problems. Appl Soft Comput 2008. [DOI: 10.1016/j.asoc.2007.03.007] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
30
|
Ma BG. How to describe genes: Enlightenment from the quaternary number system. Biosystems 2007; 90:20-7. [PMID: 16945479 DOI: 10.1016/j.biosystems.2006.06.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2005] [Revised: 06/15/2006] [Accepted: 06/19/2006] [Indexed: 11/17/2022]
Abstract
As an open problem, computational gene identification has been widely studied, and many gene finders (software) become available today. However, little attention has been given to the problem of describing the common features of known genes in databanks to transform raw data into human understandable knowledge. In this paper, we draw attention to the task of describing genes and propose a trial implementation by treating DNA sequences as quaternary numbers. Under such a treatment, the common features of genes can be represented by a "position weight function", the core concept for a number system. In principle, the "position weight function" can be any real-valued function. In this paper, by approximating the function using trigonometric functions, some characteristic parameters indicating single nucleotide periodicities were obtained for the bacteria Escherichia coli K12's genome and the eukaryote yeast's genome. As a byproduct of this approach, a single-nucleotide-level measure is derived that complements codon-based indexes in describing the coding quality and expression level of an open reading frame (ORF). The ideas presented here have the potential to become a general methodology for biological sequence analysis.
Collapse
Affiliation(s)
- Bin-Guang Ma
- College of Chemistry and Chemical Engineering, Suzhou University, Suzhou 215006, PR China.
| |
Collapse
|
31
|
Rani TS, Bhavani SD, Bapi RS. Analysis of E. coli promoter recognition problem in dinucleotide feature space. Bioinformatics 2007; 23:582-8. [PMID: 17237059 DOI: 10.1093/bioinformatics/btl670] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Patterns in the promoter sequences within a species are known to be conserved but there exist many exceptions to this rule which makes the promoter recognition a complex problem. Although many complex feature extraction schemes coupled with several classifiers have been proposed for promoter recognition in the current literature, the problem is still open. RESULTS A dinucleotide global feature extraction method is proposed for the recognition of sigma-70 promoters in Escherichia coli in this article. The positive data set consists of sigma-70 promoters with known transcription starting points which are part of regulonDB and promec databases. Four different kinds of negative data sets are considered, two of them biological sets (Gordon et al., 2003) and the other two synthetic data sets. Our results reveal that a single-layer perceptron using dinucleotide features is able to achieve an accuracy of 80% against a background of biological non-promoters and 96% for random data sets. A scheme for locating the promoter regions in a given genome sequence is proposed. A deeper analysis of the data set shows that there is a bifurcation of the data set into two distinct classes, a majority class and a minority class. Our results point out that majority class constituting the majority promoter and the majority non-promoter signal is linearly separable. Also the minority class is linearly separable. We further show that the feature extraction and classification methods proposed in the paper are generic enough to be applied to the more complex problem of eucaryotic promoter recognition. We present Drosophila promoter recognition as a case study. AVAILABILITY http://202.41.85.117/htmfiles/faculty/tsr/tsr.html.
Collapse
Affiliation(s)
- T Sobha Rani
- Computational Intelligence Lab, Department of Computer and Information Sciences, University of Hyderabad, Hyderabad 500046, India.
| | | | | |
Collapse
|
32
|
GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res 2006; 34:3585-98. [PMID: 16855295 PMCID: PMC1524905 DOI: 10.1093/nar/gkl372] [Citation(s) in RCA: 98] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
Collapse
Affiliation(s)
- Debraj GuhaThakurta
- Research Genetics Division, Rosetta Inpharmatics LLC, Merck & Co., Inc, 401 Terry Avenue North, Seattle, WA 98109, USA.
| |
Collapse
|
33
|
Abstract
The gene identification problem is the problem of interpreting nucleotide sequences by computer, in order to provide tentative annotation on the location, structure, and functional class of protein-coding genes. This problem is of self-evident importance, and is far from being fully solved, particularly for higher eukaryotes. Thus it is not surprising that the number of algorithm and software developers working in the area is rapidly increasing. The present paper is an overview of the field, with an emphasis on eukaryotes, for such developers.
Collapse
Affiliation(s)
- J W Fickett
- Theoretical Biology and Biophysics Group, MS K710, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
| |
Collapse
|
34
|
Carvalho AM, Freitas AT, Oliveira AL, Sagot MF. An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:126-40. [PMID: 17048399 DOI: 10.1109/tcbb.2006.16] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the data set sequences. This type of conserved regions, called structured motifs, is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The complexity analysis shows a time and space gain over the best known exact algorithms that is exponential in the spacings between binding sites. A full implementation of the algorithm was developed and made available online. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than four orders of magnitude. The application of the method to biological data sets shows its ability to extract relevant consensi.
Collapse
|
35
|
Extracting Gene Regulation Information from Microarray Time-Series Data Using Hidden Markov Models. ACTA ACUST UNITED AC 2006. [DOI: 10.1007/11902140_17] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
|
36
|
Liu R, Agarwal P. Computational identification of transcription factors involved in early cellular response to a stimulus. J Bioinform Comput Biol 2005; 3:949-64. [PMID: 16078369 DOI: 10.1142/s0219720005001405] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2004] [Revised: 01/14/2005] [Accepted: 01/17/2005] [Indexed: 11/18/2022]
Abstract
The response of genes to cell stimuli is often measured by microarrays. However, studying the function of these genes rarely elucidate as to how the stimuli activate or suppress these genes. To understand the mechanisms of cell stimulation, we describe a computational method for analyzing mammalian promoters of early response genes to detect the transcription factors activated by cell stimulation. We first analyzed promoters of the response genes, for transcription factor binding sites conserved between human and mouse. We then applied hypergeometric statistics in conjunction with Bonferroni correction to identify the top transcription factors whose binding sites were significantly over-represented among these promoters. In five data sets with early response genes, a significantly larger than expected number of genes had binding sites in their promoters for transcription factors previously known to be involved in response to the stimulus, while data sets with measurements at longer time points (24 hours) failed to show such over-representation. Because the end points of signal transduction pathways are transcription factors, this methodology is useful for exploring signaling pathways activated by various stimuli through microarray studies.
Collapse
Affiliation(s)
- Rongxiang Liu
- Bioinformatics Division, GlaxoSmithKline, UW2230, 709 Swedeland Road, King of Prussia, Pennsylvania 19406, USA.
| | | |
Collapse
|
37
|
Bae K, Mallick BK, Elsik CG. Prediction of protein interdomain linker regions by a hidden Markov model. Bioinformatics 2005; 21:2264-70. [PMID: 15746283 DOI: 10.1093/bioinformatics/bti363] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Our aim was to predict protein interdomain linker regions using sequence alone, without requiring known homology. Identifying linker regions will delineate domain boundaries, and can be used to computationally dissect proteins into domains prior to clustering them into families. We developed a hidden Markov model of linker/non-linker sequence regions using a linker index derived from amino acid propensity. We employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo, Gibbs sampling in particular, to simulate parameters from the posteriors. Our model recognizes sequence data to be continuous rather than categorical, and generates a probabilistic output. RESULTS We applied our method to a dataset of protein sequences in which domains and interdomain linkers had been delineated using the Pfam-A database. The prediction results are superior to a simpler method that also uses linker index.
Collapse
Affiliation(s)
- Kyounghwa Bae
- Department of Statistics, Texas A&M University College Station, TX 77843-3143, USA
| | | | | |
Collapse
|
38
|
Bi C, Rogan PK. Bipartite pattern discovery by entropy minimization-based multiple local alignment. Nucleic Acids Res 2004; 32:4979-91. [PMID: 15388800 PMCID: PMC521645 DOI: 10.1093/nar/gkh825] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2004] [Revised: 08/11/2004] [Accepted: 08/26/2004] [Indexed: 11/14/2022] Open
Abstract
Many multimeric transcription factors recognize DNA sequence patterns by cooperatively binding to bipartite elements composed of half sites separated by a flexible spacer. We developed a novel bipartite algorithm, bipartite pattern discovery (Bipad), which produces a mathematical model based on information maximization or Shannon's entropy minimization principle, for discovery of bipartite sequence patterns. Bipad is a C++ program that applies greedy methods to search the bipartite alignment space and examines the upstream or downstream regions of co-regulated genes, looking for cis-regulatory bipartite patterns. An input sequence file with zero or one site per locus is required, and the left and right motif widths and a range of possible gap lengths must be specified. Bipad can run in either single-block or bipartite pattern search modes, and it is capable of comprehensively searching all four orientations of half-site patterns. Simulation studies showed that the accuracy of this motif discovery algorithm depends on sample size and motif conservation level, but results were independent of background composition. Bipad performed equivalent with or better than other pattern search algorithms in correctly identifying Escherichia coli cyclic AMP receptor protein and Bacillus subtilis sigma factor binding site sequences based on experimentally defined benchmarks. Finally, a new bipartite information weight matrix for vitamin D3 receptor/retinoid X receptor alpha (VDR/RXRalpha) binding sites was derived that comprehensively models the natural variability inherent in these sequence elements.
Collapse
Affiliation(s)
- Chengpeng Bi
- Laboratory of Human Molecular Genetics, Children's Mercy Hospital & Clinics, 2401 Gillham Road, Kansas City, MO 64108, USA
| | | |
Collapse
|
39
|
Kechris KJ, van Zwet E, Bickel PJ, Eisen MB. Detecting DNA regulatory motifs by incorporating positional trends in information content. Genome Biol 2004; 5:R50. [PMID: 15239835 PMCID: PMC463320 DOI: 10.1186/gb-2004-5-7-r50] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2004] [Revised: 05/04/2004] [Accepted: 05/04/2004] [Indexed: 11/10/2022] Open
Abstract
On the basis of the observation that conserved positions in transcription factor binding sites are often clustered together, we propose a simple extension to the model-based motif discovery methods. We assign position-specific prior distributions to the frequency parameters of the model, penalizing deviations from a specified conservation profile. Examples with both simulated and real data show that this extension helps discover motifs as the data become noisier or when there is a competing false motif.
Collapse
Affiliation(s)
- Katherina J Kechris
- Department of Statistics, University of California, Berkeley, CA 94720, USA
- Current address: Department of Biochemistry and Biophysics, 600 16th Street 2240, University of California, San Francisco, CA 94143, USA
| | - Erik van Zwet
- Department of Statistics, University of California, Berkeley, CA 94720, USA
- Current address: Mathematical Institute, University Leiden, 2300 RA Leiden, The Netherlands
| | - Peter J Bickel
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | - Michael B Eisen
- Department of Genome Sciences, Life Sciences Division, Ernest Orlando Lawrence Berkeley National Lab, Cyclotron Road, Berkeley, CA 94720, USA
- Center for Integrative Genomics, Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
40
|
Oakley BA, Hanna DM. A Review of Nanobioscience and Bioinformatics Initiatives in North America. IEEE Trans Nanobioscience 2004; 3:74-84. [PMID: 15382648 DOI: 10.1109/tnb.2003.820259] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Barbara A Oakley
- School of Engineering and Computer Science, Oakland University, Rochester, MI 48309, USA.
| | | |
Collapse
|
41
|
Jensen ST, Liu XS, Zhou Q, Liu JS. Computational Discovery of Gene Regulatory Binding Motifs: A Bayesian Perspective. Stat Sci 2004. [DOI: 10.1214/088342304000000107] [Citation(s) in RCA: 55] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
42
|
Rombauts S, Florquin K, Lescot M, Marchal K, Rouzé P, van de Peer Y. Computational approaches to identify promoters and cis-regulatory elements in plant genomes. PLANT PHYSIOLOGY 2003; 132:1162-76. [PMID: 12857799 PMCID: PMC167057 DOI: 10.1104/pp.102.017715] [Citation(s) in RCA: 77] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/14/2002] [Revised: 01/10/2003] [Accepted: 03/17/2003] [Indexed: 05/19/2023]
Abstract
The identification of promoters and their regulatory elements is one of the major challenges in bioinformatics and integrates comparative, structural, and functional genomics. Many different approaches have been developed to detect conserved motifs in a set of genes that are either coregulated or orthologous. However, although recent approaches seem promising, in general, unambiguous identification of regulatory elements is not straightforward. The delineation of promoters is even harder, due to its complex nature, and in silico promoter prediction is still in its infancy. Here, we review the different approaches that have been developed for identifying promoters and their regulatory elements. We discuss the detection of cis-acting regulatory elements using word-counting or probabilistic methods (so-called "search by signal" methods) and the delineation of promoters by considering both sequence content and structural features ("search by content" methods). As an example of search by content, we explored in greater detail the association of promoters with CpG islands. However, due to differences in sequence content, the parameters used to detect CpG islands in humans and other vertebrates cannot be used for plants. Therefore, a preliminary attempt was made to define parameters that could possibly define CpG and CpNpG islands in Arabidopsis, by exploring the compositional landscape around the transcriptional start site. To this end, a data set of more than 5,000 gene sequences was built, including the promoter region, the 5'-untranslated region, and the first introns and coding exons. Preliminary analysis shows that promoter location based on the detection of potential CpG/CpNpG islands in the Arabidopsis genome is not straightforward. Nevertheless, because the landscape of CpG/CpNpG islands differs considerably between promoters and introns on the one side and exons (whether coding or not) on the other, more sophisticated approaches can probably be developed for the successful detection of "putative" CpG and CpNpG islands in plants.
Collapse
Affiliation(s)
- Stephane Rombauts
- Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology, Ghent University, B-9000 Gent, Belgium
| | | | | | | | | | | |
Collapse
|
43
|
Petersen L, Larsen TS, Ussery DW, On SLW, Krogh A. RpoD promoters in Campylobacter jejuni exhibit a strong periodic signal instead of a -35 box. J Mol Biol 2003; 326:1361-72. [PMID: 12595250 DOI: 10.1016/s0022-2836(03)00034-2] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
We have used a hidden Markov model (HMM) to identify the consensus sequence of the RpoD promoters in the genome of Campylobacter jejuni. The identified promoter consensus sequence is unusual compared to other bacteria, in that the region upstream of the TATA-box does not contain a conserved -35 region, but shows a very strong periodic variation in the AT-content and semi-conserved T-stretches, with a period of 10-11 nucleotides. The TATA-box is in some, but not all cases, preceded by a TGx, similar to an extended -10 promoter. We predicted a total of 764 presumed RpoD promoters in the C.jejuni genome, of which 654 were located upstream of annotated genes. A similar promoter was identified in Helicobacter pylori, a close phylogenetic relative of Campylobacter, but not in Escherichia coli, Vibrio cholerae, or six other Proteobacterial genomes, or in Staphylococcus aureus. We used upstream regions of high confidence genes as training data (n=529, for the C.jejuni genome). We found it necessary to limit the training set to genes that are preceded by an intergenic region of >100bp or by a gene oriented in the opposite direction to be able to identify a conserved sequence motif, and ended up with a training set of 175 genes. This leads to the conclusion that the remaining genes (354) are more rarely preceded by a (RpoD) promoter, and consequently that operon structure may be more widespread in C.jejuni than has been assumed by others. Structural predictions of the regions upstream of the TATA-box indicates a region of highly curved DNA, and we assume that this facilitates the wrapping of the DNA around the RNA polymerase holoenzyme, and offsets the absence of a conserved -35 binding motif.
Collapse
Affiliation(s)
- Lise Petersen
- Center for Biological Sequence Analysis, Technical University of Denmark, DK-2800 Lyngby, Denmark.
| | | | | | | | | |
Collapse
|
44
|
Shahmuradov IA, Gammerman AJ, Hancock JM, Bramley PM, Solovyev VV. PlantProm: a database of plant promoter sequences. Nucleic Acids Res 2003; 31:114-7. [PMID: 12519961 PMCID: PMC165488 DOI: 10.1093/nar/gkg041] [Citation(s) in RCA: 177] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
PlantProm DB, a plant promoter database, is an annotated, non-redundant collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s), TSS, from various plant species. The first release (2002.01) of PlantProm DB contains 305 entries including 71, 220 and 14 promoters from monocot, dicot and other plants, respectively. It provides DNA sequence of the promoter regions (-200 : +51) with TSS on the fixed position +201, taxonomic/promoter type classification of promoters and Nucleotide Frequency Matrices (NFM) for promoter elements: TATA-box, CCAAT-box and TSS-motif (Inr). Analysis of TSS-motifs revealed that their composition is different in dicots and monocots, as well as for TATA and TATA-less promoters. The database serves as learning set in developing plant promoter prediction programs. One such program (TSSP) based on discriminant analysis has been created by Softberry Inc. and the application of a support ftp: vector machine approach for promoter identification is under development. PlantProm DB is available at http://mendel.cs.rhul.ac.uk/ and http://www.softberry.com/.
Collapse
Affiliation(s)
- Ilham A Shahmuradov
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK
| | | | | | | | | |
Collapse
|
45
|
Abstract
The DNA motif discovery problem abstracts the task of discovering short, conserved sites in genomic DNA. Pevzner and Sze recently described a precise combinatorial formulation of motif discovery that motivates the following algorithmic challenge: find twenty planted occurrences of a motif of length fifteen in roughly twelve kilobases of genomic sequence, where each occurrence of the motif differs from its consensus in four randomly chosen positions. Such "subtle" motifs, though statistically highly significant, expose a weakness in existing motif-finding algorithms, which typically fail to discover them. Pevzner and Sze introduced new algorithms to solve their (15,4)-motif challenge, but these methods do not scale efficiently to more difficult problems in the same family, such as the (14,4)-, (16,5)-, and (18,6)-motif problems. We introduce a novel motif-discovery algorithm, PROJECTION, designed to enhance the performance of existing motif finders using random projections of the input's substrings. Experiments on synthetic data demonstrate that PROJECTION remedies the weakness observed in existing algorithms, typically solving the difficult (14,4)-, (16,5)-, and (18,6)-motif problems. Our algorithm is robust to nonuniform background sequence distributions and scales to larger amounts of sequence than that specified in the original challenge. A probabilistic estimate suggests that related motif-finding problems that PROJECTION fails to solve are in all likelihood inherently intractable. We also test the performance of our algorithm on realistic biological examples, including transcription factor binding sites in eukaryotes and ribosome binding sites in prokaryotes.
Collapse
Affiliation(s)
- Jeremy Buhler
- Department of Computer Science, Box 1045, Washington University, One Brookings Drive, St. Louis, MO 63130, USA.
| | | |
Collapse
|
46
|
Marsan L, Sagot MF. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J Comput Biol 2001; 7:345-62. [PMID: 11108467 DOI: 10.1089/106652700750050826] [Citation(s) in RCA: 171] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p > or = 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes--that is, the motifs themselves--are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the noncoding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N2n where n is the average length of the sequences and N their number. An application to the identification of promoter and regulatory consensus sequences in bacterial genomes is shown.
Collapse
Affiliation(s)
- L Marsan
- Institut Gaspard Monge, Université de Marne la Vallée 5
| | | |
Collapse
|
47
|
Ehret GB, Reichenbach P, Schindler U, Horvath CM, Fritz S, Nabholz M, Bucher P. DNA binding specificity of different STAT proteins. Comparison of in vitro specificity with natural target sites. J Biol Chem 2001; 276:6675-88. [PMID: 11053426 DOI: 10.1074/jbc.m001748200] [Citation(s) in RCA: 301] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
STAT transcription factors are expressed in many cell types and bind to similar sequences. However, different STAT gene knock-outs show very distinct phenotypes. To determine whether differences between the binding specificities of STAT proteins account for these effects, we compared the sequences bound by STAT1, STAT5A, STAT5B, and STAT6. One sequence set was selected from random oligonucleotides by recombinant STAT1, STAT5A, or STAT6. For another set including many weak binding sites, we quantified the relative affinities to STAT1, STAT5A, STAT5B, and STAT6. We compared the results to the binding sites in natural STAT target genes identified by others. The experiments confirmed the similar specificity of different STAT proteins. Detailed analysis indicated that STAT5A specificity is more similar to that of STAT6 than that of STAT1, as expected from the evolutionary relationships. The preference of STAT6 for sites in which the half-palindromes (TTC) are separated by four nucleotides (N(4)) was confirmed, but analysis of weak binding sites showed that STAT6 binds fairly well to N(3) sites. As previously reported, STAT1 and STAT5 prefer N(3) sites; however, STAT5A, but not STAT1, weakly binds N(4) sites. None of the STATs bound to half-palindromes. There were no specificity differences between STAT5A and STAT5B.
Collapse
Affiliation(s)
- G B Ehret
- Swiss Institute for Experimental Cancer Research (ISREC) 1066 Epalinges, Switzerland.
| | | | | | | | | | | | | |
Collapse
|
48
|
McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Derbyshire V, Lawrence CE. Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 2001; 29:774-82. [PMID: 11160901 PMCID: PMC30389 DOI: 10.1093/nar/29.3.774] [Citation(s) in RCA: 198] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Toward the goal of identifying complete sets of transcription factor (TF)-binding sites in the genomes of several gamma proteobacteria, and hence describing their transcription regulatory networks, we present a phylogenetic footprinting method for identifying these sites. Probable transcription regulatory sites upstream of Escherichia coli genes were identified by cross-species comparison using an extended Gibbs sampling algorithm. Close examination of a study set of 184 genes with documented transcription regulatory sites revealed that when orthologous data were available from at least two other gamma proteobacterial species, 81% of our predictions corresponded with the documented sites, and 67% corresponded when data from only one other species were available. That the remaining predictions included bona fide TF-binding sites was proven by affinity purification of a putative transcription factor (YijC) bound to such a site upstream of the fabA gene. Predicted regulatory sites for 2097 E.coli genes are available at http://www.wadsworth.org/resnres/bioinfo/.
Collapse
Affiliation(s)
- L McCue
- The Wadsworth Center for Laboratories and Research, New York State Department of Health, Albany, NY 12201, USA
| | | | | | | | | | | | | |
Collapse
|
49
|
Qicheng Ma, Wang J, Shasha D, Wu C. DNA sequence classification via an expectation maximization algorithm and neural networks: a case study. ACTA ACUST UNITED AC 2001. [DOI: 10.1109/5326.983930] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
50
|
van Helden J, Rios AF, Collado-Vides J. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res 2000; 28:1808-18. [PMID: 10734201 PMCID: PMC102821 DOI: 10.1093/nar/28.8.1808] [Citation(s) in RCA: 214] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The application of microarray and related technologies is currently generating a systematic catalog of the transcriptional response of any single gene to a multiplicity of experimental conditions. Clustering genes according to the similarity of their transcriptional response provides a direct hint to the regulons of the different transcription factors, many of which have still not been characterized. We have developed a new method for deciphering the mechanism underlying the common transcriptional response of a set of genes, i.e. discovering cis -acting regulatory elements from a set of unaligned upstream sequences. This method, called dyad analysis, is based on the observation that many regulatory sites consist of a pair of highly conserved trinucleotides, spaced by a non-conserved region of fixed width. The approach is to count the number of occurrences of each possible spaced pair of trinucleotides, and to assess its statistical significance. The method is highly efficient in the detection of sites bound by C(6)Zn(2)binuclear cluster proteins, as well as other transcription factors. In addition, we show that the dyad and single-word analyses are efficient for the detection of regulatory patterns in gene clusters from DNA chip experiments. In combination, these programs should provide a fast and efficient way to discover new regulatory sites for as yet unknown transcription factors.
Collapse
Affiliation(s)
- J van Helden
- Unité de Conformation des Macromolécules Biologiques, Université Libre de Bruxelles, CP 160/16, 50 av. F. D. Roosevelt, B-1050 Bruxelles, Belgium.
| | | | | |
Collapse
|