1
|
Qiu S, Wan X, Liang Y, Lamoureux CR, Akbari A, Palsson BO, Zielinski DC. Inferred regulons are consistent with regulator binding sequences in E. coli. PLoS Comput Biol 2024; 20:e1011824. [PMID: 38252668 PMCID: PMC10833566 DOI: 10.1371/journal.pcbi.1011824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 02/01/2024] [Accepted: 01/12/2024] [Indexed: 01/24/2024] Open
Abstract
The transcriptional regulatory network (TRN) of E. coli consists of thousands of interactions between regulators and DNA sequences. Regulons are typically determined either from resource-intensive experimental measurement of functional binding sites, or inferred from analysis of high-throughput gene expression datasets. Recently, independent component analysis (ICA) of RNA-seq compendia has shown to be a powerful method for inferring bacterial regulons. However, it remains unclear to what extent regulons predicted by ICA structure have a biochemical basis in promoter sequences. Here, we address this question by developing machine learning models that predict inferred regulon structures in E. coli based on promoter sequence features. Models were constructed successfully (cross-validation AUROC > = 0.8) for 85% (40/47) of ICA-inferred E. coli regulons. We found that: 1) The presence of a high scoring regulator motif in the promoter region was sufficient to specify regulatory activity in 40% (19/47) of the regulons, 2) Additional features, such as DNA shape and extended motifs that can account for regulator multimeric binding, helped to specify regulon structure for the remaining 60% of regulons (28/47); 3) investigating regulons where initial machine learning models failed revealed new regulator-specific sequence features that improved model accuracy. Finally, we found that strong regulatory binding sequences underlie both the genes shared between ICA-inferred and experimental regulons as well as genes in the E. coli core pan-regulon of Fur. This work demonstrates that the structure of ICA-inferred regulons largely can be understood through the strength of regulator binding sites in promoter regions, reinforcing the utility of top-down inference for regulon discovery.
Collapse
Affiliation(s)
- Sizhe Qiu
- Department of Bioengineering, University of California San Diego, La Jolla, CA, United States of America
| | - Xinlong Wan
- Department of Bioengineering, University of California San Diego, La Jolla, CA, United States of America
| | - Yueshan Liang
- Department of Bioengineering, University of California San Diego, La Jolla, CA, United States of America
| | - Cameron R. Lamoureux
- Department of Bioengineering, University of California San Diego, La Jolla, CA, United States of America
| | - Amir Akbari
- Department of Bioengineering, University of California San Diego, La Jolla, CA, United States of America
| | - Bernhard O. Palsson
- Department of Bioengineering, University of California San Diego, La Jolla, CA, United States of America
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Lyngby, Denmark
| | - Daniel C. Zielinski
- Department of Bioengineering, University of California San Diego, La Jolla, CA, United States of America
| |
Collapse
|
2
|
Li F, Leier A, Liu Q, Wang Y, Xiang D, Akutsu T, Webb GI, Smith AI, Marquez-Lago T, Li J, Song J. Procleave: Predicting Protease-specific Substrate Cleavage Sites by Combining Sequence and Structural Information. GENOMICS, PROTEOMICS & BIOINFORMATICS 2020; 18:52-64. [PMID: 32413515 PMCID: PMC7393547 DOI: 10.1016/j.gpb.2019.08.002] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 08/08/2019] [Accepted: 10/23/2019] [Indexed: 10/29/2022]
Abstract
Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in the degradation and recycling of proteins, which is essential for various physiological processes. Thus, solving the substrate identification problem will have important implications for the precise understanding of functions and physiological roles of proteases, as well as for therapeutic target identification and pharmaceutical applicability. Consequently, there is a great demand for bioinformatics methods that can predict novel substrate cleavage events with high accuracy by utilizing both sequence and structural information. In this study, we present Procleave, a novel bioinformatics approach for predicting protease-specific substrates and specific cleavage sites by taking into account both their sequence and 3D structural information. Structural features of known cleavage sites were represented by discrete values using a LOWESS data-smoothing optimization method, which turned out to be critical for the performance of Procleave. The optimal approximations of all structural parameter values were encoded in a conditional random field (CRF) computational framework, alongside sequence and chemical group-based features. Here, we demonstrate the outstanding performance of Procleave through extensive benchmarking and independent tests. Procleave is capable of correctly identifying most cleavage sites in the case study. Importantly, when applied to the human structural proteome encompassing 17,628 protein structures, Procleave suggests a number of potential novel target substrates and their corresponding cleavage sites of different proteases. Procleave is implemented as a webserver and is freely accessible at http://procleave.erc.monash.edu/.
Collapse
Affiliation(s)
- Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Andre Leier
- School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35233, USA
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Yanan Wang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Dongxu Xiang
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Tatiana Marquez-Lago
- School of Medicine, University of Alabama at Birmingham, Birmingham, AL 35233, USA.
| | - Jian Li
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC 3800, Australia.
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia; Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia; ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia.
| |
Collapse
|
3
|
In silico based screening of WRKY genes for identifying functional genes regulated by WRKY under salt stress. Comput Biol Chem 2019; 83:107131. [DOI: 10.1016/j.compbiolchem.2019.107131] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Revised: 08/18/2019] [Accepted: 09/18/2019] [Indexed: 11/21/2022]
|
4
|
Identification of Intrinsically Disordered Proteins and Regions by Length-Dependent Predictors Based on Conditional Random Fields. MOLECULAR THERAPY-NUCLEIC ACIDS 2019; 17:396-404. [PMID: 31307006 PMCID: PMC6626971 DOI: 10.1016/j.omtn.2019.06.004] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 06/06/2019] [Accepted: 06/07/2019] [Indexed: 01/24/2023]
Abstract
Accurate identification of intrinsically disordered proteins/regions (IDPs/IDRs) is critical for predicting protein structure and function. Previous studies have shown that IDRs of different lengths have different characteristics, and several classification-based predictors have been proposed for predicting different types of IDRs. Compared with these classification-based predictors, the previously proposed predictor IDP-CRF exhibits state-of-the-art performance for predicting IDPs/IDRs, which is a sequence labeling model based on conditional random fields (CRFs). Motivated by these methods, we propose a predictor called IDP-FSP, which is an ensemble of three CRF-based predictors called IDP-FSP-L, IDP-FSP-S, and IDP-FSP-G. These three predictors are specially designed to predict long, short, and generic disordered regions, respectively, and they are constructed based on different features. To the best of our knowledge, IDP-FSP is the first predictor that combines a sequence labeling algorithm with IDRs of different lengths. Experimental results using two independent test datasets show that IDP-FSP achieves better or at least comparable predictive performance with 26 existing state-of-the-art methods in this field, proving the effectiveness of IDP-FSP.
Collapse
|
5
|
Yella VR, Bhimsaria D, Ghoshdastidar D, Rodríguez-Martínez J, Ansari AZ, Bansal M. Flexibility and structure of flanking DNA impact transcription factor affinity for its core motif. Nucleic Acids Res 2018; 46:11883-11897. [PMID: 30395339 PMCID: PMC6294565 DOI: 10.1093/nar/gky1057] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2018] [Revised: 10/11/2018] [Accepted: 10/17/2018] [Indexed: 01/13/2023] Open
Abstract
Spatial and temporal expression of genes is essential for maintaining phenotype integrity. Transcription factors (TFs) modulate expression patterns by binding to specific DNA sequences in the genome. Along with the core binding motif, the flanking sequence context can play a role in DNA-TF recognition. Here, we employ high-throughput in vitro and in silico analyses to understand the influence of sequences flanking the cognate sites in binding of three most prevalent eukaryotic TF families (zinc finger, homeodomain and bZIP). In vitro binding preferences of each TF toward the entire DNA sequence space were correlated with a wide range of DNA structural parameters, including DNA flexibility. Results demonstrate that conformational plasticity of flanking regions modulates binding affinity of certain TF families. DNA duplex stability and minor groove width also play an important role in DNA-TF recognition but differ in how exactly they influence the binding in each specific case. Our analyses further reveal that the structural features of preferred flanking sequences are not universal, as similar DNA-binding folds can employ distinct DNA recognition modes.
Collapse
Affiliation(s)
- Venkata Rajesh Yella
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
- Department of Biotechnology, Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur, Andhra Pradesh 522502, India
| | - Devesh Bhimsaria
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
| | | | - José A Rodríguez-Martínez
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
- Department of Biology, University of Puerto Rico-Rio Piedras, San Juan, PR 00925, USA
| | - Aseem Z Ansari
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
- The Genome Center of Wisconsin, Madison, WI 53706, USA
| | - Manju Bansal
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore 560012, India
| |
Collapse
|
6
|
Liu Y, Wang X, Liu B. IDP⁻CRF: Intrinsically Disordered Protein/Region Identification Based on Conditional Random Fields. Int J Mol Sci 2018; 19:E2483. [PMID: 30135358 PMCID: PMC6164615 DOI: 10.3390/ijms19092483] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2018] [Revised: 08/14/2018] [Accepted: 08/18/2018] [Indexed: 12/16/2022] Open
Abstract
Accurate prediction of intrinsically disordered proteins/regions is one of the most important tasks in bioinformatics, and some computational predictors have been proposed to solve this problem. How to efficiently incorporate the sequence-order effect is critical for constructing an accurate predictor because disordered region distributions show global sequence patterns. In order to capture these sequence patterns, several sequence labelling models have been applied to this field, such as conditional random fields (CRFs). However, these methods suffer from certain disadvantages. In this study, we proposed a new computational predictor called IDP⁻CRF, which is trained on an updated benchmark dataset based on the MobiDB database and the DisProt database, and incorporates more comprehensive sequence-based features, including PSSMs (position-specific scoring matrices), kmer, predicted secondary structures, and relative solvent accessibilities. Experimental results on the benchmark dataset and two independent datasets show that IDP⁻CRF outperforms 25 existing state-of-the-art methods in this field, demonstrating that IDP⁻CRF is a very useful tool for identifying IDPs/IDRs (intrinsically disordered proteins/regions). We anticipate that IDP⁻CRF will facilitate the development of protein sequence analysis.
Collapse
Affiliation(s)
- Yumeng Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, Guangdong, China.
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, Guangdong, China.
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen 518055, Guangdong, China.
| |
Collapse
|
7
|
Khamis AM, Motwalli O, Oliva R, Jankovic BR, Medvedeva YA, Ashoor H, Essack M, Gao X, Bajic VB. A novel method for improved accuracy of transcription factor binding site prediction. Nucleic Acids Res 2018; 46:e72. [PMID: 29617876 PMCID: PMC6037060 DOI: 10.1093/nar/gky237] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2017] [Revised: 03/01/2018] [Accepted: 03/20/2018] [Indexed: 12/12/2022] Open
Abstract
Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.
Collapse
Affiliation(s)
- Abdullah M Khamis
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Olaa Motwalli
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Romina Oliva
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
- Department of Sciences and Technologies, University ‘Parthenope’ of Naples, Centro Direzionale Isola C4 80143, Naples, Italy
| | - Boris R Jankovic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Yulia A Medvedeva
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
- Institute of Bioengineering, Research Centre of Biotechnology, Russian Academy of Science, 117312 Moscow, Russia
- Department of Computational Biology, Vavilov Institute of General Genetics, Russian Academy of Science, 119991 Moscow, Russia
- Department of Biological and Medical Physics, Moscow Institute of Physics and Technology, 141701, Dolgoprudny, Moscow Region, Russia
| | - Haitham Ashoor
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Magbubah Essack
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| | - Vladimir B Bajic
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955–6900, Saudi Arabia
| |
Collapse
|
8
|
Käppel S, Melzer R, Rümpler F, Gafert C, Theißen G. The floral homeotic protein SEPALLATA3 recognizes target DNA sequences by shape readout involving a conserved arginine residue in the MADS-domain. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2018; 95:341-357. [PMID: 29744943 DOI: 10.1111/tpj.13954] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Revised: 04/17/2018] [Accepted: 04/23/2018] [Indexed: 05/05/2023]
Abstract
SEPALLATA3 of Arabidopsis thaliana is a MADS-domain transcription factor (TF) and a key regulator of flower development. MADS-domain proteins bind to sequences termed 'CArG-boxes' [consensus 5'-CC(A/T)6 GG-3']. Because only a fraction of the CArG-boxes in the Arabidopsis genome are bound by SEPALLATA3, more elaborate principles have to be discovered to better understand which features turn CArG-boxes into genuine recognition sites. Here, we investigate to what extent the shape of the DNA is involved in a 'shape readout' that contributes to the binding of SEPALLATA3. We determined in vitro binding affinities of SEPALLATA3 to DNA probes that all contain the CArG-box motif, but differ in their predicted DNA shape. We found that binding affinity correlates well with a narrow minor groove of the DNA. Substitution of canonical bases with non-standard bases supports the hypothesis of minor groove shape readout by SEPALLATA3. Analysis of mutant SEPALLATA3 proteins further revealed that a highly conserved arginine residue, which is expected to contact the DNA minor groove, contributes significantly to the shape readout. Our studies show that the specific recognition of cis-regulatory elements by a plant MADS-domain TF, and by inference probably also of other TFs of this type, heavily depends on shape readout mechanisms.
Collapse
Affiliation(s)
- Sandra Käppel
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Rainer Melzer
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Florian Rümpler
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Christian Gafert
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Günter Theißen
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| |
Collapse
|
9
|
Ryasik A, Orlov M, Zykova E, Ermak T, Sorokin A. Bacterial promoter prediction: Selection of dynamic and static physical properties of DNA for reliable sequence classification. J Bioinform Comput Biol 2018; 16:1840003. [DOI: 10.1142/s0219720018400036] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Predicting promoter activity of DNA fragment is an important task for computational biology. Approaches using physical properties of DNA to predict bacterial promoters have recently gained a lot of attention. To select an adequate set of physical properties for training a classifier, various characteristics of DNA molecule should be taken into consideration. Here, we present a systematic approach that allows us to select less correlated properties for classification by means of both correlation and cophenetic coefficients as well as concordance matrices. To prove this concept, we have developed the first classifier that uses not only sequence and static physical properties of DNA fragment, but also dynamic properties of DNA open states. Therefore, the best performing models with accuracy values up to 90% for all types of sequences were obtained. Furthermore, we have demonstrated that the classifier can serve as a reliable tool enabling promoter DNA fragments to be distinguished from promoter islands despite the similarity of their nucleotide sequences.
Collapse
Affiliation(s)
- Artem Ryasik
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Mikhail Orlov
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| | - Evgenia Zykova
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
- Department of Applied Research Informatization, State Institute of Information Technologies and Telecommunications (SIIT&T Informika), per. Brusov 21 st.2, Moscow, 125009, Russia
| | - Timofei Ermak
- Laboratory of Molecular Genetics Systems, Institute of Cytology and Genetics, pr. Akademika Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Anatoly Sorokin
- Mechanism of Cell Genome Functioning Laboratory, Institute of Cell Biophysics, ul. Institutskaya 3, Pushchino 142290, Russia
| |
Collapse
|
10
|
Sun S, Zhang X, Peng Q. A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli. Artif Intell Med 2017; 75:16-23. [PMID: 28363453 DOI: 10.1016/j.artmed.2016.11.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Accepted: 11/23/2016] [Indexed: 11/29/2022]
Abstract
BACKGROUND Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods. MATERIALS AND METHOD In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database. RESULTS In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods. CONCLUSION The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications.
Collapse
Affiliation(s)
- Shiquan Sun
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China; Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA.
| | - Xiongpan Zhang
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China.
| | - Qinke Peng
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China.
| |
Collapse
|
11
|
Passagem-Santos D, Bonnet M, Sobral D, Trancoso I, Silva JG, Barreto VM, Athanasiadis A, Demengeot J, Pereira-Leal JB. RAG Recombinase as a Selective Pressure for Genome Evolution. Genome Biol Evol 2016; 8:3364-3376. [PMID: 27979968 PMCID: PMC5203794 DOI: 10.1093/gbe/evw261] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
The RAG recombinase is a domesticated transposable element co-opted in jawed vertebrates to drive the process of the so-called V(D)J recombination, which is the hallmark of the adaptive immune system to produce antigen receptors. RAG targets, namely, the Recombination Signal Sequences (RSS), are rather long and degenerated sequences, which highlights the ability of the recombinase to interact with a wide range of target sequences, including outside of antigen receptor loci. The recognition of such cryptic targets by the recombinase threatens genome integrity by promoting aberrant DNA recombination, as observed in lymphoid malignancies. Genomes evolution resulting from RAG acquisition is an ongoing discussion, in particular regarding the counter-selection of sequences resembling the RSS and the modifications of epigenetic regulation at these potential cryptic sites. Here, we describe a new bioinformatics tool to map potential RAG targets in all jawed vertebrates. We show that our REcombination Classifier (REC) outperforms the currently available tool and is suitable for full genomes scans from species other than human and mouse. Using the REC, we document a reduction in density of potential RAG targets at the transcription start sites of genes co-expressed with the rag genes and marked with high levels of the trimethylation of the lysine 4 of the histone 3 (H3K4me3), which correlates with the retention of functional RAG activity after the horizontal transfer.
Collapse
Affiliation(s)
| | - M Bonnet
- Instituto Gulbenkian de Ciência, Oeiras, Portugal
| | - D Sobral
- Instituto Gulbenkian de Ciência, Oeiras, Portugal
| | - I Trancoso
- Instituto Gulbenkian de Ciência, Oeiras, Portugal
| | - J G Silva
- Instituto Gulbenkian de Ciência, Oeiras, Portugal
| | - V M Barreto
- Instituto Gulbenkian de Ciência, Oeiras, Portugal
| | | | - J Demengeot
- Instituto Gulbenkian de Ciência, Oeiras, Portugal
| | | |
Collapse
|
12
|
In silico identification of enhancers on the basis of a combination of transcription factor binding motif occurrences. Sci Rep 2016; 6:32476. [PMID: 27582178 PMCID: PMC5007594 DOI: 10.1038/srep32476] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2016] [Accepted: 08/08/2016] [Indexed: 01/06/2023] Open
Abstract
Enhancers interact with gene promoters and form chromatin looping structures that serve important functions in various biological processes, such as the regulation of gene transcription and cell differentiation. However, enhancers are difficult to identify because they generally do not have fixed positions or consensus sequence features, and biological experiments for enhancer identification are costly in terms of labor and expense. In this work, several models were built by using various sequence-based feature sets and their combinations for enhancer prediction. The selected features derived from a recursive feature elimination method showed that the model using a combination of 141 transcription factor binding motif occurrences from 1,422 transcription factor position weight matrices achieved a favorably high prediction accuracy superior to that of other reported methods. The models demonstrated good prediction accuracy for different enhancer datasets obtained from different cell lines/tissues. In addition, prediction accuracy was further improved by integration of chromatin state features. Our method is complementary to wet-lab experimental methods and provides an additional method to identify enhancers.
Collapse
|
13
|
Qin W, Zhao G, Carson M, Jia C, Lu H. Knowledge-based three-body potential for transcription factor binding site prediction. IET Syst Biol 2016; 10:23-9. [PMID: 26816396 DOI: 10.1049/iet-syb.2014.0066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
A structure-based statistical potential is developed for transcription factor binding site (TFBS) prediction. Besides the direct contact between amino acids from TFs and DNA bases, the authors also considered the influence of the neighbouring base. This three-body potential showed better discriminate powers than the two-body potential. They validate the performance of the potential in TFBS identification, binding energy prediction and binding mutation prediction.
Collapse
Affiliation(s)
- Wenyi Qin
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Guijun Zhao
- Key Laboratory of Molecular Embryology, Ministry of Health & Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, People's Republic of China
| | - Matthew Carson
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Caiyan Jia
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, People's Republic of China
| | - Hui Lu
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA.
| |
Collapse
|
14
|
Zrimec J, Lapanje A. Fast Prediction of DNA Melting Bubbles Using DNA Thermodynamic Stability. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1137-1145. [PMID: 26451825 DOI: 10.1109/tcbb.2015.2396057] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
DNA melting bubbles are the basis of many DNA-protein interactions, such as those in regulatory DNA regions driving gene expression, DNA replication and bacterial horizontal gene transfer. Bubble formation is affected by DNA duplex stability and thermally induced duplex destabilization (TIDD). Although prediction of duplex stability with the nearest neighbor (NN) method is much faster than prediction of TIDD with the Peyrard-Bishop-Dauxois (PBD) model, PBD predicted TIDD defines regulatory DNA regions with higher accuracy and detail. Here, we considered that PBD predicted TIDD is inherently related to the intrinsic duplex stabilities of destabilization regions. We show by regression modeling that NN duplex stabilities can be used to predict TIDD almost as accurately as is predicted with PBD. Predicted TIDD is in fact ascribed to non-linear transformation of NN duplex stabilities in destabilization regions as well as effects of neighboring regions relative to destabilization size. Since the prediction time of our models is over six orders of magnitude shorter than that of PBD, the models present an accessible tool for researchers. TIDD can be predicted on our webserver at http://tidd.immt.eu.
Collapse
|
15
|
Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol 2015; 11:e1004418. [PMID: 26291518 PMCID: PMC4546298 DOI: 10.1371/journal.pcbi.1004418] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 06/29/2015] [Indexed: 11/19/2022] Open
Abstract
Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA “intrinsic properties” (in silico predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome. Identification of transcription factor binding sites based on sequence motifs is typically accompanied by a high false positive rate. Increasing evidence suggests that there are many other factors besides DNA sequence that may affect the binding and interaction of TFs with DNA. Through the integration of sequence motif, chromatin state, and DNA structure properties, we show that TF binding can be better predicted. Moreover, considering chromatin state and DNA structure properties simultaneously yields a significant improvement. While the binding of some TFs can be readily predicted using either chromatin state information or DNA structure, other TFs need both. Thus, our findings provide insights on how different histone modifications and DNA structure properties may influence the binding of a particular TF and thus how TFs regulate gene expression. These features are referred to as sequence “intrinsic properties” because they can be predicted from sequences alone. These intrinsic properties can be used to build a TF binding prediction model that has a similar performance to considering all features. Moreover, the intrinsic property model allows TFBS predictions not only across TFs, but also across DNA-binding domain families that are present in most eukaryotes, suggesting that the model likely can be used across species.
Collapse
|
16
|
Abe N, Dror I, Yang L, Slattery M, Zhou T, Bussemaker HJ, Rohs R, Mann RS. Deconvolving the recognition of DNA shape from sequence. Cell 2015; 161:307-18. [PMID: 25843630 DOI: 10.1016/j.cell.2015.02.008] [Citation(s) in RCA: 138] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2014] [Revised: 12/08/2014] [Accepted: 01/26/2015] [Indexed: 01/25/2023]
Abstract
Protein-DNA binding is mediated by the recognition of the chemical signatures of the DNA bases and the 3D shape of the DNA molecule. Because DNA shape is a consequence of sequence, it is difficult to dissociate these modes of recognition. Here, we tease them apart in the context of Hox-DNA binding by mutating residues that, in a co-crystal structure, only recognize DNA shape. Complexes made with these mutants lose the preference to bind sequences with specific DNA shape features. Introducing shape-recognizing residues from one Hox protein to another swapped binding specificities in vitro and gene regulation in vivo. Statistical machine learning revealed that the accuracy of binding specificity predictions improves by adding shape features to a model that only depends on sequence, and feature selection identified shape features important for recognition. Thus, shape readout is a direct and independent component of binding site selection by Hox proteins.
Collapse
Affiliation(s)
- Namiko Abe
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA; Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | - Iris Dror
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA; Department of Biology, Technion - Israel Institute of Technology, Haifa 32000, Israel
| | - Lin Yang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY 10032, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA; Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA; Department of Physics and Astronomy, University of Southern California, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| | - Richard S Mann
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA; Department of Systems Biology, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
17
|
Dai Z, Guo D, Dai X, Xiong Y. Genome-wide analysis of transcription factor binding sites and their characteristic DNA structures. BMC Genomics 2015; 16 Suppl 3:S8. [PMID: 25708259 PMCID: PMC4331811 DOI: 10.1186/1471-2164-16-s3-s8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Transcription factors (TF) regulate gene expression by binding DNA regulatory regions. Transcription factor binding sites (TFBSs) are conserved not only in primary DNA sequences but also in DNA structures. However, the global relationship between TFs and their preferred DNA structures remains to be elucidated. Results In this paper, we have developed a computational method to generate a genome-wide landscape of TFs and their characteristic binding DNA structures in Saccharomyces cerevisiae. We revealed DNA structural features for different TFs. The structural conservation shows positional preference in TFBSs. Structural levels of DNA sequences are correlated with TF-DNA binding affinities. Conclusions We provided the genome-wide correspondences of TFs to DNA structures. Our findings will have implications in understanding TF regulatory mechanisms.
Collapse
|
18
|
Varicella-zoster virus-derived major histocompatibility complex class I-restricted peptide affinity is a determining factor in the HLA risk profile for the development of postherpetic neuralgia. J Virol 2014; 89:962-9. [PMID: 25355886 DOI: 10.1128/jvi.02500-14] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
UNLABELLED Postherpetic neuralgia (PHN) is the most common complication of herpes zoster and is typified by a lingering pain that can last months or years after the characteristic herpes zoster rash disappears. It is well known that there are risk factors for the development of PHN, such as its association with certain HLA alleles. In this study, previous HLA genotyping results were collected and subjected to a meta-analysis with increased statistical power. This work shows that the alleles HLA-A*33 and HLA-B*44 are significantly enriched in PHN patients, while HLA-A*02 and HLA-B*40 are significantly depleted. Prediction of the varicella-zoster virus (VZV) peptide affinity for these four HLA variants by using one in-house-developed and two existing state-of-the-art major histocompatibility complex (MHC) class I ligand prediction methods reveals that there is a great difference in their absolute and relative peptide binding repertoires. It was observed that HLA-A*02 displays a high affinity for an ∼7-fold-higher number of VZV peptides than HLA-B*44. Furthermore, after correction for HLA allele-specific limitations, the relative affinity of HLA-A*33 and HLA-B*44 for VZV peptides was found to be significantly lower than those of HLA-A*02 and HLA-B*40. In addition, HLA peptide affinity calculations indicate strong trends for VZV to avoid high-affinity peptides in some of its proteins, independent of the studied HLA allele. IMPORTANCE Varicella-zoster virus can cause two distinct diseases: chickenpox (varicella) and shingles (herpes zoster). Varicella is a common disease in young children, while herpes zoster is more frequent in older individuals. A common complication of herpes zoster is postherpetic neuralgia, a persistent and debilitating pain that can remain months up to years after the resolution of the rash. In this study, we show that the relative affinity of HLA variants associated with higher postherpetic neuralgia risk for varicella-zoster virus peptides is lower than that of variants with a lower risk. These results provide new insight into the development of postherpetic neuralgia and strongly support the hypothesis that one of its possible underlying causes is a suboptimal anti-VZV immune response due to weak HLA binding peptide affinity.
Collapse
|
19
|
Chiu TP, Yang L, Zhou T, Main BJ, Parker SCJ, Nuzhdin SV, Tullius TD, Rohs R. GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res 2014; 43:D103-9. [PMID: 25326329 PMCID: PMC4384032 DOI: 10.1093/nar/gku977] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.
Collapse
Affiliation(s)
- Tsu-Pei Chiu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Bradley J Main
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Stephen C J Parker
- Departments of Computational Medicine and Bioinformatics and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sergey V Nuzhdin
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Thomas D Tullius
- Department of Chemistry and Program in Bioinformatics, Boston University, Boston, MA 02215, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA Departments of Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
20
|
An improved systematic approach to predicting transcription factor target genes using support vector machine. PLoS One 2014; 9:e94519. [PMID: 24743548 PMCID: PMC3990533 DOI: 10.1371/journal.pone.0094519] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2012] [Accepted: 03/17/2014] [Indexed: 11/21/2022] Open
Abstract
Biological prediction of transcription factor binding sites and their corresponding transcription factor target genes (TFTGs) makes great contribution to understanding the gene regulatory networks. However, these approaches are based on laborious and time-consuming biological experiments. Numerous computational approaches have shown great potential to circumvent laborious biological methods. However, the majority of these algorithms provide limited performances and fail to consider the structural property of the datasets. We proposed a refined systematic computational approach for predicting TFTGs. Based on previous work done on identifying auxin response factor target genes from Arabidopsis thaliana co-expression data, we adopted a novel reverse-complementary distance-sensitive n-gram profile algorithm. This algorithm converts each upstream sub-sequence into a high-dimensional vector data point and transforms the prediction task into a classification problem using support vector machine-based classifier. Our approach showed significant improvement compared to other computational methods based on the area under curve value of the receiver operating characteristic curve using 10-fold cross validation. In addition, in the light of the highly skewed structure of the dataset, we also evaluated other metrics and their associated curves, such as precision-recall curves and cost curves, which provided highly satisfactory results.
Collapse
|
21
|
Meysman P, Collado-Vides J, Morett E, Viola R, Engelen K, Laukens K. Structural properties of prokaryotic promoter regions correlate with functional features. PLoS One 2014; 9:e88717. [PMID: 24516674 PMCID: PMC3918002 DOI: 10.1371/journal.pone.0088717] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2013] [Accepted: 01/10/2014] [Indexed: 12/31/2022] Open
Abstract
The structural properties of the DNA molecule are known to play a critical role in transcription. In this paper, the structural profiles of promoter regions were studied within the context of their diversity and their function for eleven prokaryotic species; Escherichia coli, Klebsiella pneumoniae, Salmonella Typhimurium, Pseudomonas auroginosa, Geobacter sulfurreducens Helicobacter pylori, Chlamydophila pneumoniae, Synechocystis sp., Synechoccocus elongates, Bacillus anthracis, and the archaea Sulfolobus solfataricus. The main anchor point for these promoter regions were transcription start sites identified through high-throughput experiments or collected within large curated databases. Prokaryotic promoter regions were found to be less stable and less flexible than the genomic mean across all studied species. However, direct comparison between species revealed differences in their structural profiles that can not solely be explained by the difference in genomic GC content. In addition, comparison with functional data revealed that there are patterns in the promoter structural profiles that can be linked to specific functional loci, such as sigma factor regulation or transcription factor binding. Interestingly, a novel structural element clearly visible near the transcription start site was found in genes associated with essential cellular functions and growth in several species. Our analyses reveals the great diversity in promoter structural profiles both between and within prokaryotic species. We observed relationships between structural diversity and functional features that are interesting prospects for further research to yet uncharacterized functional loci defined by DNA structural properties.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
| | - Julio Collado-Vides
- Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
| | - Enrique Morett
- Instituto de Biotecnología, Universidad Nacional Autónoma de México, Cuernavaca, Morelos, Mexico
- Instituto Nacional de Medicina Genómica, Mexico City, Mexico
| | - Roberto Viola
- Department of Computational Biology, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
| | - Kristof Engelen
- Department of Computational Biology, Fondazione Edmund Mach, San Michele all’Adige, Trento, Italy
- * E-mail: (KE); (KL)
| | - Kris Laukens
- Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
- Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, Edegem, Belgium
- * E-mail: (KE); (KL)
| |
Collapse
|
22
|
Computational prediction of transcription factor binding sites based on an integrative approach incorporating genomic and epigenomic features. Genes Genomics 2014. [DOI: 10.1007/s13258-013-0136-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
23
|
Meysman P, Sonego P, Bianco L, Fu Q, Ledezma-Tejeida D, Gama-Castro S, Liebens V, Michiels J, Laukens K, Marchal K, Collado-Vides J, Engelen K. COLOMBOS v2.0: an ever expanding collection of bacterial expression compendia. Nucleic Acids Res 2013; 42:D649-53. [PMID: 24214998 PMCID: PMC3965013 DOI: 10.1093/nar/gkt1086] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
The COLOMBOS database (http://www.colombos.net) features comprehensive organism-specific cross-platform gene expression compendia of several bacterial model organisms and is supported by a fully interactive web portal and an extensive web API. COLOMBOS was originally published in PLoS One, and COLOMBOS v2.0 includes both an update of the expression data, by expanding the previously available compendia and by adding compendia for several new species, and an update of the surrounding functionality, with improved search and visualization options and novel tools for programmatic access to the database. The scope of the database has also been extended to incorporate RNA-seq data in our compendia by a dedicated analysis pipeline. We demonstrate the validity and robustness of this approach by comparing the same RNA samples measured in parallel using both microarrays and RNA-seq. As far as we know, COLOMBOS currently hosts the largest homogenized gene expression compendia available for seven bacterial model organisms.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Mathematics and Computer Science, University of Antwerp, B-2020 Antwerp, Belgium, Biomedical Informatics Research Center Antwerp (biomina), University of Antwerp/Antwerp University Hospital, B-2650 Edegem, Belgium, Department of Computational Biology, Research and Innovation Center, Fondazione Edmund Mach, San Michele all'Adige, Trento (TN) 38010, Italy, Department of Microbial and Molecular Sciences, KU Leuven, Leuven B-3001, Belgium, Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Morelos 62210, Mexico, Department of Plant Biotechnology and Bioinformatics, Ghent University, Gent 9052, Belgium and Department of Information Technology, IMinds, Ghent University, Gent 9052, Belgium
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
24
|
Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, Rohs R. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2013; 42:D148-55. [PMID: 24214955 PMCID: PMC3964943 DOI: 10.1093/nar/gkt1087] [Citation(s) in RCA: 91] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.
Collapse
Affiliation(s)
- Lin Yang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA, Department of Biology, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel, Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC, Canada and Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA
| | | | | | | | | | | | | |
Collapse
|
25
|
Band smearing of PCR amplified bacterial 16S rRNA genes: dependence on initial PCR target diversity. J Microbiol Methods 2013; 95:186-94. [PMID: 23954706 DOI: 10.1016/j.mimet.2013.08.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2013] [Revised: 08/05/2013] [Accepted: 08/05/2013] [Indexed: 11/23/2022]
Abstract
Band smearing in agarose gels of PCR amplified bacterial 16S rRNA genes is understood to comprise amplicons of varying sizes arising from PCR errors, and requires elimination. We consider that with amplified heterogeneous DNA, delayed electro-migration is caused not by PCR errors but by dsDNA structures that arise from imperfect strand pairing. The extent of band smearing was found to be proportional to the sequence heterogeneity in 16S rRNA variable regions. Denaturing alkaline gels showed that all amplified DNA was of the correct size. A novel bioinformatic approach was used to reveal that band smearing occurred due to imperfectly paired strands of the amplified DNA. Since the smear is a structural fraction of the correct size PCR product, it carries important information on richness and diversity of the target DNA. For accurate analysis, the origin of the smear must first be identified before it is eliminated by examining the amplified DNA in denaturing alkaline gels.
Collapse
|
26
|
Meysman P, Sánchez-Rodríguez A, Fu Q, Marchal K, Engelen K. Expression divergence between Escherichia coli and Salmonella enterica serovar Typhimurium reflects their lifestyles. Mol Biol Evol 2013; 30:1302-14. [PMID: 23427276 PMCID: PMC3649669 DOI: 10.1093/molbev/mst029] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Escherichia coli K12 is a commensal bacteria and one of the best-studied model organisms. Salmonella enterica serovar Typhimurium, on the other hand, is a facultative intracellular pathogen. These two prokaryotic species can be considered related phylogenetically, and they share a large amount of their genetic material, which is commonly termed the "core genome." Despite their shared core genome, both species display very different lifestyles, and it is unclear to what extent the core genome, apart from the species-specific genes, plays a role in this lifestyle divergence. In this study, we focus on the differences in expression domains for the orthologous genes in E. coli and S. Typhimurium. The iterative comparison of coexpression methodology was used on large expression compendia of both species to uncover the conservation and divergence of gene expression. We found that gene expression conservation occurs mostly independently from amino acid similarity. According to our estimates, at least more than one quarter of the orthologous genes has a different expression domain in E. coli than in S. Typhimurium. Genes involved with key cellular processes are most likely to have conserved their expression domains, whereas genes showing diverged expression are associated with metabolic processes that, although present in both species, are regulated differently. The expression domains of the shared "core" genome of E. coli and S. Typhimurium, consisting of highly conserved orthologs, have been tuned to help accommodate the differences in lifestyle and the pathogenic potential of Salmonella.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Microbial and Molecular Systems, KU Leuven, Leuven, Belgium
| | | | | | | | | |
Collapse
|
27
|
Nowak-Lovato K, Alexandrov LB, Banisadr A, Bauer AL, Bishop AR, Usheva A, Mu F, Hong-Geller E, Rasmussen KØ, Hlavacek WS, Alexandrov BS. Binding of nucleoid-associated protein fis to DNA is regulated by DNA breathing dynamics. PLoS Comput Biol 2013; 9:e1002881. [PMID: 23341768 PMCID: PMC3547798 DOI: 10.1371/journal.pcbi.1002881] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Accepted: 11/29/2012] [Indexed: 12/23/2022] Open
Abstract
Physicochemical properties of DNA, such as shape, affect protein-DNA recognition. However, the properties of DNA that are most relevant for predicting the binding sites of particular transcription factors (TFs) or classes of TFs have yet to be fully understood. Here, using a model that accurately captures the melting behavior and breathing dynamics (spontaneous local openings of the double helix) of double-stranded DNA, we simulated the dynamics of known binding sites of the TF and nucleoid-associated protein Fis in Escherichia coli. Our study involves simulations of breathing dynamics, analysis of large published in vitro and genomic datasets, and targeted experimental tests of our predictions. Our simulation results and available in vitro binding data indicate a strong correlation between DNA breathing dynamics and Fis binding. Indeed, we can define an average DNA breathing profile that is characteristic of Fis binding sites. This profile is significantly enriched among the identified in vivo E. coli Fis binding sites. To test our understanding of how Fis binding is influenced by DNA breathing dynamics, we designed base-pair substitutions, mismatch, and methylation modifications of DNA regions that are known to interact (or not interact) with Fis. The goal in each case was to make the local DNA breathing dynamics either closer to or farther from the breathing profile characteristic of a strong Fis binding site. For the modified DNA segments, we found that Fis-DNA binding, as assessed by gel-shift assay, changed in accordance with our expectations. We conclude that Fis binding is associated with DNA breathing dynamics, which in turn may be regulated by various nucleotide modifications. Cellular transcription factors (TFs) are proteins that regulate gene expression, and thereby cellular activity and fate, by binding to specific DNA segments. The physicochemical determinants of protein-DNA binding specificity are not completely understood. Here, we report that the propensity of transient opening and re-closing of the double helix, resulting from thermal fluctuations, aka “DNA breathing” or “DNA bubbles,” can be associated with binding affinity in the case of Fis, a well-studied nucleoid-associated protein in Escherichia coli. We found that a particular breathing profile is characteristic of high-affinity Fis binding sites and that DNA fragments known to bind Fis in vivo are statistically enriched for this profile. Furthermore, we used simulations of DNA breathing dynamics to guide design of gel-shift experiments aimed at testing the idea that local breathing influences Fis binding. As a result, we show that via nucleotide modifications but without modifying nucleotides that directly contact Fis, we were able to transform a low-affinity Fis binding site into a high-affinity site and vice versa. The nucleotide modifications were designed only based on DNA breathing simulations. Our study suggests that strong Fis-DNA binding depends on DNA breathing - a novel physicochemical characteristic that could be used for prediction and rational design of TF binding sites.
Collapse
Affiliation(s)
- Kristy Nowak-Lovato
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Ludmil B. Alexandrov
- Cancer Genome Project, Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Afsheen Banisadr
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Amy L. Bauer
- X-Theoretical Design Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Alan R. Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Anny Usheva
- Harvard Medical School, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
| | - Fangping Mu
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Elizabeth Hong-Geller
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Kim Ø. Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - William S. Hlavacek
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail: (WSH); (BSA)
| | - Boian S. Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail: (WSH); (BSA)
| |
Collapse
|
28
|
Maienschein-Cline M, Dinner AR, Hlavacek WS, Mu F. Improved predictions of transcription factor binding sites using physicochemical features of DNA. Nucleic Acids Res 2012; 40:e175. [PMID: 22923524 PMCID: PMC3526315 DOI: 10.1093/nar/gks771] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Typical approaches for predicting transcription factor binding sites (TFBSs) involve use of a position-specific weight matrix (PWM) to statistically characterize the sequences of the known sites. Recently, an alternative physicochemical approach, called SiteSleuth, was proposed. In this approach, a linear support vector machine (SVM) classifier is trained to distinguish TFBSs from background sequences based on local chemical and structural features of DNA. SiteSleuth appears to generally perform better than PWM-based methods. Here, we improve the SiteSleuth approach by considering both new physicochemical features and algorithmic modifications. New features are derived from Gibbs energies of amino acid-DNA interactions and hydroxyl radical cleavage profiles of DNA. Algorithmic modifications consist of inclusion of a feature selection step, use of a nonlinear kernel in the SVM classifier, and use of a consensus-based post-processing step for predictions. We also considered SVM classification based on letter features alone to distinguish performance gains from use of SVM-based models versus use of physicochemical features. The accuracy of each of the variant methods considered was assessed by cross validation using data available in the RegulonDB database for 54 Escherichia coli TFs, as well as by experimental validation using published ChIP-chip data available for Fis and Lrp.
Collapse
|
29
|
Meysman P, Marchal K, Engelen K. DNA structural properties in the classification of genomic transcription regulation elements. Bioinform Biol Insights 2012; 6:155-68. [PMID: 22837642 PMCID: PMC3399529 DOI: 10.4137/bbi.s9426] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
It has been long known that DNA molecules encode information at various levels. The most basic level comprises the base sequence itself and is primarily important for the encoding of proteins and direct base recognition by DNA-binding proteins. A more elusive level consists of the local structural properties of the DNA molecule wherein the DNA sequence only plays an indirect supportive role. These properties are nevertheless an important factor in a large number of biomolecular processes and can be considered as informative signals for the presence of a variety of genomic features. Several recent studies have unequivocally shown the benefit of relying on such DNA properties for modeling and predicting genomic features as diverse as transcription start sites, transcription factor binding sites, or nucleosome occupancy. This review is meant to provide an overview of the key aspects of these DNA conformational and physicochemical properties. To illustrate their potential added value compared to relying solely on the nucleotide sequence in genomics studies, we discuss their application in research on transcription regulation mechanisms as representative cases.
Collapse
Affiliation(s)
- Pieter Meysman
- Department of Molecular and Microbial Systems, KULeuven, Kasteelpark Arenberg 20, 3001 Leuven, Belgium
| | | | | |
Collapse
|
30
|
Hooghe B, Broos S, van Roy F, De Bleser P. A flexible integrative approach based on random forest improves prediction of transcription factor binding sites. Nucleic Acids Res 2012; 40:e106. [PMID: 22492513 PMCID: PMC3413102 DOI: 10.1093/nar/gks283] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023] Open
Abstract
Transcription factor binding sites (TFBSs) are DNA sequences of 6–15 base pairs. Interaction of these TFBSs with transcription factors (TFs) is largely responsible for most spatiotemporal gene expression patterns. Here, we evaluate to what extent sequence-based prediction of TFBSs can be improved by taking into account the positional dependencies of nucleotides (NPDs) and the nucleotide sequence-dependent structure of DNA. We make use of the random forest algorithm to flexibly exploit both types of information. Results in this study show that both the structural method and the NPD method can be valuable for the prediction of TFBSs. Moreover, their predictive values seem to be complementary, even to the widely used position weight matrix (PWM) method. This led us to combine all three methods. Results obtained for five eukaryotic TFs with different DNA-binding domains show that our method improves classification accuracy for all five eukaryotic TFs compared with other approaches. Additionally, we contrast the results of seven smaller prokaryotic sets with high-quality data and show that with the use of high-quality data we can significantly improve prediction performance. Models developed in this study can be of great use for gaining insight into the mechanisms of TF binding.
Collapse
Affiliation(s)
- Bart Hooghe
- Department of Biomedical Molecular Biology, Ghent University, B-9052 Ghent, Belgium
| | | | | | | |
Collapse
|
31
|
Wang D, Do HT. Computational localization of transcription factor binding sites using extreme learning machines. Soft comput 2012. [DOI: 10.1007/s00500-012-0820-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
32
|
Engelen K, Fu Q, Meysman P, Sánchez-Rodríguez A, De Smet R, Lemmens K, Fierro AC, Marchal K. COLOMBOS: access port for cross-platform bacterial expression compendia. PLoS One 2011; 6:e20938. [PMID: 21779320 PMCID: PMC3136457 DOI: 10.1371/journal.pone.0020938] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2011] [Accepted: 05/13/2011] [Indexed: 12/26/2022] Open
Abstract
Background Microarrays are the main technology for large-scale transcriptional gene expression profiling, but the large bodies of data available in public databases are not useful due to the large heterogeneity. There are several initiatives that attempt to bundle these data into expression compendia, but such resources for bacterial organisms are scarce and limited to integration of experiments from the same platform or to indirect integration of per experiment analysis results. Methodology/Principal Findings We have constructed comprehensive organism-specific cross-platform expression compendia for three bacterial model organisms (Escherichia coli, Bacillus subtilis, and Salmonella enterica serovar Typhimurium) together with an access portal, dubbed COLOMBOS, that not only provides easy access to the compendia, but also includes a suite of tools for exploring, analyzing, and visualizing the data within these compendia. It is freely available at http://bioi.biw.kuleuven.be/colombos. The compendia are unique in directly combining expression information from different microarray platforms and experiments, and we illustrate the potential benefits of this direct integration with a case study: extending the known regulon of the Fur transcription factor of E. coli. The compendia also incorporate extensive annotations for both genes and experimental conditions; these heterogeneous data are functionally integrated in the COLOMBOS analysis tools to interactively browse and query the compendia not only for specific genes or experiments, but also metabolic pathways, transcriptional regulation mechanisms, experimental conditions, biological processes, etc. Conclusions/Significance We have created cross-platform expression compendia for several bacterial organisms and developed a complementary access port COLOMBOS, that also serves as a convenient expression analysis tool to extract useful biological information. This work is relevant to a large community of microbiologists by facilitating the use of publicly available microarray experiments to support their research.
Collapse
Affiliation(s)
- Kristof Engelen
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
- * E-mail: (KE); (KM)
| | - Qiang Fu
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
| | - Pieter Meysman
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
| | - Aminael Sánchez-Rodríguez
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
| | - Riet De Smet
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
| | - Karen Lemmens
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
| | - Ana Carolina Fierro
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
| | - Kathleen Marchal
- Department of Microbial and Molecular Systems, Katholieke Universiteit Leuven, Heverlee-Leuven, Belgium
- * E-mail: (KE); (KM)
| |
Collapse
|