1
|
Trerotola M, Antolini L, Beni L, Guerra E, Spadaccini M, Verzulli D, Moschella A, Alberti S. A deterministic code for transcription factor-DNA recognition through computation of binding interfaces. NAR Genom Bioinform 2022; 4:lqac008. [PMID: 35261972 PMCID: PMC8896162 DOI: 10.1093/nargab/lqac008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 12/05/2021] [Accepted: 02/28/2022] [Indexed: 11/14/2022] Open
Abstract
Abstract
The recognition code between transcription factor (TF) amino acids and DNA bases remains poorly understood. Here, the determinants of TF amino acid-DNA base binding selectivity were identified through the analysis of crystals of TF-DNA complexes. Selective, high-frequency interactions were identified for the vast majority of amino acid side chains (‘structural code’). DNA binding specificities were then independently assessed by meta-analysis of random-mutagenesis studies of Zn finger-target DNA sequences. Selective, high-frequency interactions were identified for the majority of mutagenized residues (‘mutagenesis code’). The structural code and the mutagenesis code were shown to match to a striking level of accuracy (P = 3.1 × 10−33), suggesting the identification of fundamental rules of TF binding to DNA bases. Additional insight was gained by showing a geometry-dictated choice among DNA-binding TF residues with overlapping specificity. These findings indicate the existence of a DNA recognition mode whereby the physical-chemical characteristics of the interacting residues play a deterministic role. The discovery of this DNA recognition code advances our knowledge on fundamental features of regulation of gene expression and is expected to pave the way for integration with higher-order complexity approaches.
Collapse
Affiliation(s)
- Marco Trerotola
- Laboratory of Cancer Pathology, Center for Advanced Studies and Technology (CAST), University “G. D’ Annunzio”, Via L. Polacchi 11, 66100 Chieti, Italy
- Department of Medical, Oral and Biotechnological Sciences, University “G. d’Annunzio”, 66100 Chieti, Italy
| | - Laura Antolini
- Center for Biostatistics, Department of Clinical Medicine, Prevention and Biotechnology, University of Milano-Bicocca, 20052 Monza, Italy
| | - Laura Beni
- Laboratory of Cancer Pathology, Center for Advanced Studies and Technology (CAST), University “G. D’ Annunzio”, Via L. Polacchi 11, 66100 Chieti, Italy
| | - Emanuela Guerra
- Laboratory of Cancer Pathology, Center for Advanced Studies and Technology (CAST), University “G. D’ Annunzio”, Via L. Polacchi 11, 66100 Chieti, Italy
- Department of Medical, Oral and Biotechnological Sciences, University “G. d’Annunzio”, 66100 Chieti, Italy
| | | | - Damiano Verzulli
- Unit of Informatics, University “G. d’Annunzio”, 66100 Chieti, Italy
| | - Antonino Moschella
- Unit of Medical Genetics, Department of Biomedical Sciences - BIOMORF, University of Messina, via Consolare Valeria, 98125 Messina, Italy
| | - Saverio Alberti
- Laboratory of Cancer Pathology, Center for Advanced Studies and Technology (CAST), University “G. D’ Annunzio”, Via L. Polacchi 11, 66100 Chieti, Italy
- Unit of Medical Genetics, Department of Biomedical Sciences - BIOMORF, University of Messina, via Consolare Valeria, 98125 Messina, Italy
| |
Collapse
|
2
|
Cui F, Zhang Z, Zou Q. Sequence representation approaches for sequence-based protein prediction tasks that use deep learning. Brief Funct Genomics 2021; 20:61-73. [PMID: 33527980 DOI: 10.1093/bfgp/elaa030] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Revised: 12/16/2020] [Accepted: 12/18/2020] [Indexed: 11/12/2022] Open
Abstract
Deep learning has been increasingly used in bioinformatics, especially in sequence-based protein prediction tasks, as large amounts of biological data are available and deep learning techniques have been developed rapidly in recent years. For sequence-based protein prediction tasks, the selection of a suitable model architecture is essential, whereas sequence data representation is a major factor in controlling model performance. Here, we summarized all the main approaches that are used to represent protein sequence data (amino acid sequence encoding or embedding), which include end-to-end embedding methods, non-contextual embedding methods and embedding methods that use transfer learning and others that are applied for some specific tasks (such as protein sequence embedding based on extracted features for protein structure predictions and graph convolutional network-based embedding for drug discovery tasks). We have also reviewed the architectures of various types of embedding models theoretically and the development of these types of sequence embedding approaches to facilitate researchers and users in selecting the model that best suits their requirements.
Collapse
Affiliation(s)
- Feifei Cui
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Zilong Zhang
- University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| |
Collapse
|
3
|
Shi L, Liu L, Lv X, Ma Z, Li C, Li Y, Zhao F, Sun D, Han B. Identification of genetic effects and potential causal polymorphisms of CPM gene impacting milk fatty acid traits in Chinese Holstein. Anim Genet 2020; 51:491-501. [PMID: 32301146 DOI: 10.1111/age.12936] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2019] [Revised: 02/03/2020] [Accepted: 03/15/2020] [Indexed: 11/27/2022]
Abstract
Our previous GWAS revealed 83 significant SNPs and 20 promising candidate genes associated with milk fatty acid traits in dairy cattle. Out of them, the carboxypeptidase M (CPM) gene contains a genome-wide significant SNP, Hapmap49848-BTA-106779, which is strongly associated with myristic acid (C14:0; P = 0.0064). Herein, we aimed to confirm the genetic effects of CPM on milk fatty acids in Chinese Holstein. Seven SNPs were detected by re-sequencing the sequences of entire exons and 3000 bp of up-/downstream flanking regions of the CPM gene, of which three were in 5' flanking region, one in the 3' UTR and three were in the 3' flanking region. Using the Haploview 4.1, we estimated the LD among the identified SNPs and found two haplotype blocks. With the animal model, we performed the SNP- and haplotype-based association analyses, and observed that these SNPs and haplotype blocks mainly had strong genetic associations with medium-chain saturated fatty acids (caproic acid, C6:0; caprylic acid, C8:0; capric acid, C10:0; and lauric acid, C12:0) (P < 0.0001-0.0257). In addition, using the Genomatix software, we predicted that three SNPs in the 5' flanking region of CPM (g.45079507A>G, g.45080228C>A and g.45080335C>G) changed the transcription factor binding sites for PREF (progesterone receptor biding site), ZBRK1 (transcription factor with eight central zinc fingers and an N-terminal KRAB domain), SOX9 (sex-determining region Y-box 9, dimeric binding sites), SOX6 (sex-determining region Y-box 6) and FOXP1-ES (alternative splicing variant of FOXP1, activated in ESCs). Further, the dual-luciferase reporter assay showed these three SNPs altered the transcriptional activity of CPM gene (P ≤ 0.0006). In summary, using the post-GWAS strategy, we first confirmed the significant genetic effects of CPM with milk fatty acids in dairy cattle, and identified three potential causal mutations.
Collapse
Affiliation(s)
- L Shi
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science and Technology, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, 100193, China.,Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China
| | - L Liu
- Beijing Dairy Cattle Center, Beijing, 100192, China
| | - X Lv
- Beijing Dairy Cattle Center, Beijing, 100192, China
| | - Z Ma
- Beijing Dairy Cattle Center, Beijing, 100192, China
| | - C Li
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science and Technology, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, 100193, China
| | - Y Li
- Beijing Dairy Cattle Center, Beijing, 100192, China
| | - F Zhao
- Beijing Dairy Cattle Center, Beijing, 100192, China
| | - D Sun
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science and Technology, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, 100193, China
| | - B Han
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science and Technology, Key Laboratory of Animal Genetics, Breeding and Reproduction of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory for Animal Breeding, China Agricultural University, Beijing, 100193, China
| |
Collapse
|
4
|
Käppel S, Melzer R, Rümpler F, Gafert C, Theißen G. The floral homeotic protein SEPALLATA3 recognizes target DNA sequences by shape readout involving a conserved arginine residue in the MADS-domain. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2018; 95:341-357. [PMID: 29744943 DOI: 10.1111/tpj.13954] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2018] [Revised: 04/17/2018] [Accepted: 04/23/2018] [Indexed: 05/05/2023]
Abstract
SEPALLATA3 of Arabidopsis thaliana is a MADS-domain transcription factor (TF) and a key regulator of flower development. MADS-domain proteins bind to sequences termed 'CArG-boxes' [consensus 5'-CC(A/T)6 GG-3']. Because only a fraction of the CArG-boxes in the Arabidopsis genome are bound by SEPALLATA3, more elaborate principles have to be discovered to better understand which features turn CArG-boxes into genuine recognition sites. Here, we investigate to what extent the shape of the DNA is involved in a 'shape readout' that contributes to the binding of SEPALLATA3. We determined in vitro binding affinities of SEPALLATA3 to DNA probes that all contain the CArG-box motif, but differ in their predicted DNA shape. We found that binding affinity correlates well with a narrow minor groove of the DNA. Substitution of canonical bases with non-standard bases supports the hypothesis of minor groove shape readout by SEPALLATA3. Analysis of mutant SEPALLATA3 proteins further revealed that a highly conserved arginine residue, which is expected to contact the DNA minor groove, contributes significantly to the shape readout. Our studies show that the specific recognition of cis-regulatory elements by a plant MADS-domain TF, and by inference probably also of other TFs of this type, heavily depends on shape readout mechanisms.
Collapse
Affiliation(s)
- Sandra Käppel
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Rainer Melzer
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
- School of Biology and Environmental Science, University College Dublin, Belfield, Dublin 4, Ireland
| | - Florian Rümpler
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Christian Gafert
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| | - Günter Theißen
- Department of Genetics, Friedrich Schiller University Jena, Philosophenweg 12, D-07743, Jena, Germany
| |
Collapse
|
5
|
Lee W, Park B, Han K. Sequence-based prediction of putative transcription factor binding sites in DNA sequences of any length. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 15:1461-1469. [PMID: 29990126 DOI: 10.1109/tcbb.2017.2773075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A transcription factor (TF) is a protein that regulates gene expression by binding to specific DNA sequences. Despite the recent advances in experimental techniques for identifying transcription factor binding sites (TFBS) in DNA sequences, a large number of TFBS are to be unveiled in many species. Several computational methods developed for predicting TFBS in DNA are tissue- or species-specific methods, so cannot be used without prior knowledge of tissue or species. Some computational methods are applicable to finding TFBS in short DNA sequences only. In this paper we propose a new learning method for predicting TFBS in DNA of any length using the composition, transition and distribution of nucleotides and amino acids in DNA and TF sequences. In independent testing of the method on datasets that were not used in training the method, its accuracy and MCC were as high as 81.84% and 0.634, respectively. The proposed method can be a useful aid for selecting potential TFBS in a large amount of DNA sequences before conducting biochemical experiments to empirically determine TFBS. The program and data sets are available at http://bclab.inha.ac.kr/TFbinding.
Collapse
|
6
|
Guo C, McDowell IC, Nodzenski M, Scholtens DM, Allen AS, Lowe WL, Reddy TE. Transversions have larger regulatory effects than transitions. BMC Genomics 2017; 18:394. [PMID: 28525990 PMCID: PMC5438547 DOI: 10.1186/s12864-017-3785-4] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Accepted: 05/10/2017] [Indexed: 12/30/2022] Open
Abstract
Background Transversions (Tv’s) are more likely to alter the amino acid sequence of proteins than transitions (Ts’s), and local deviations in the Ts:Tv ratio are indicative of evolutionary selection on genes. Whether the two different types of mutations have different effects in non-protein-coding sequences remains unknown. Genetic variants primarily impact gene expression by disrupting the binding of transcription factors (TFs) and other DNA-binding proteins. Because Tv’s cause larger changes in the shape of a DNA backbone, we hypothesized that Tv’s would have larger impacts on TF binding and gene expression. Results Here, we provide multiple lines of evidence demonstrating that Tv’s have larger impacts on regulatory DNA including analyses of TF binding motifs and allele-specific TF binding. In these analyses, we observed a depletion of Tv’s within TF binding motifs and TF binding sites. Using massively parallel population-scale reporter assays, we also provided empirical evidence that Tv’s have larger effects than Ts’s on the activity of human gene regulatory elements. Conclusions Tv’s are more likely to disrupt TF binding, resulting in larger changes in gene expression. Although the observed differences are small, these findings represent a novel, fundamental property of regulatory variation. Understanding the features of functional non-coding variation could be valuable for revealing the genetic underpinnings of complex traits and diseases in future studies. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3785-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Cong Guo
- Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, 27710, USA.,University Program in Genetics and Genomics, Duke University, Durham, NC, 27710, USA
| | - Ian C McDowell
- Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, NC, 27710, USA
| | - Michael Nodzenski
- Department of Preventive Medicine, Division of Biostatistics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Denise M Scholtens
- Department of Preventive Medicine, Division of Biostatistics, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Andrew S Allen
- Center for Statistical Genetics and Genomics, Duke University Durham, North Carolina, 27710, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, NC, 27710, USA
| | - William L Lowe
- Division of Endocrinology, Metabolism and Molecular Medicine, Department of Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, 60611, USA
| | - Timothy E Reddy
- Center for Genomic and Computational Biology, Duke University Medical School, Durham, NC, 27710, USA. .,Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, NC, 27710, USA. .,Present Address: Biostatistics & Bioinformatics, 101 Science Dr., 2347 CIEMAS, Durham, NC, 27708, USA.
| |
Collapse
|
7
|
Chen ZY, Guo XJ, Chen ZX, Chen WY, Wang JR. Identification and positional distribution analysis of transcription factor binding sites for genes from the wheat fl-cDNA sequences. Biosci Biotechnol Biochem 2017; 81:1125-1135. [PMID: 28485207 DOI: 10.1080/09168451.2017.1295803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
The binding sites of transcription factors (TFs) in upstream DNA regions are called transcription factor binding sites (TFBSs). TFBSs are important elements for regulating gene expression. To date, there have been few studies on the profiles of TFBSs in plants. In total, 4,873 sequences with 5' upstream regions from 8530 wheat fl-cDNA sequences were used to predict TFBSs. We found 4572 TFBSs for the MADS TF family, which was twice as many as for bHLH (1951), B3 (1951), HB superfamily (1914), ERF (1820), and AP2/ERF (1725) TFs, and was approximately four times higher than the remaining TFBS types. The percentage of TFBSs and TF members showed a distinct distribution in different tissues. Overall, the distribution of TFBSs in the upstream regions of wheat fl-cDNA sequences had significant difference. Meanwhile, high frequencies of some types of TFBSs were found in specific regions in the upstream sequences. Both TFs and fl-cDNA with TFBSs predicted in the same tissues exhibited specific distribution preferences for regulating gene expression. The tissue-specific analysis of TFs and fl-cDNA with TFBSs provides useful information for functional research, and can be used to identify relationships between tissue-specific TFs and fl-cDNA with TFBSs. Moreover, the positional distribution of TFBSs indicates that some types of wheat TFBS have different positional distribution preferences in the upstream regions of genes.
Collapse
Affiliation(s)
- Zhen-Yong Chen
- a Triticeae Research Institute , Sichuan Agricultural University , Chengdu , China.,b College of Life Science , China West Normal University , Nanchong , China
| | - Xiao-Jiang Guo
- a Triticeae Research Institute , Sichuan Agricultural University , Chengdu , China
| | - Zhong-Xu Chen
- a Triticeae Research Institute , Sichuan Agricultural University , Chengdu , China
| | - Wei-Ying Chen
- b College of Life Science , China West Normal University , Nanchong , China
| | - Ji-Rui Wang
- a Triticeae Research Institute , Sichuan Agricultural University , Chengdu , China
| |
Collapse
|
8
|
Sun S, Zhang X, Peng Q. A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli. Artif Intell Med 2017; 75:16-23. [PMID: 28363453 DOI: 10.1016/j.artmed.2016.11.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Accepted: 11/23/2016] [Indexed: 11/29/2022]
Abstract
BACKGROUND Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods. MATERIALS AND METHOD In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database. RESULTS In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods. CONCLUSION The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications.
Collapse
Affiliation(s)
- Shiquan Sun
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China; Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA.
| | - Xiongpan Zhang
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China.
| | - Qinke Peng
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China.
| |
Collapse
|
9
|
Kumar A, Bansal M. Unveiling DNA structural features of promoters associated with various types of TSSs in prokaryotic transcriptomes and their role in gene expression. DNA Res 2017; 24:25-35. [PMID: 27803028 PMCID: PMC5381344 DOI: 10.1093/dnares/dsw045] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 09/23/2016] [Indexed: 01/28/2023] Open
Abstract
Next-generation sequencing studies have revealed that a variety of transcripts are present in the prokaryotic transcriptome and a significant fraction of them are functional, being involved in various regulatory activities apart from coding for proteins. Identification of promoters associated with different transcripts is necessary for characterization of the transcriptome. Promoter regions have been shown to have unique structural features as compared with their flanking region, in organisms covering all domains of life. Here we report an in silico analysis of DNA sequence dependent structural properties like stability, bendability and curvature in the promoter region of six different prokaryotic transcriptomes. Using these structural features, we predicted promoters associated with different categories of transcripts (mRNA, internal, antisense and non-coding), which constitute the transcriptome. Promoter annotation using structural features is fairly accurate and reliable with about 50% of the primary promoters being characterized by all three structural properties while at least one property identifies 95%. We also studied the relative differences of these structural features in terms of gene expression and found that the features, viz. lower stability, lesser bendability and higher curvature are more prominent in the promoter regions which are associated with high gene expression as compared with low expression genes. Hence, promoters, which are associated with higher gene expression, get annotated well using DNA structural features as compared with those, which are linked to lower gene expression.
Collapse
Affiliation(s)
| | - Manju Bansal
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560012 Karnataka, India
| |
Collapse
|
10
|
Deplancke B, Alpern D, Gardeux V. The Genetics of Transcription Factor DNA Binding Variation. Cell 2016; 166:538-554. [PMID: 27471964 DOI: 10.1016/j.cell.2016.07.012] [Citation(s) in RCA: 244] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Indexed: 12/23/2022]
Abstract
Most complex trait-associated variants are located in non-coding regulatory regions of the genome, where they have been shown to disrupt transcription factor (TF)-DNA binding motifs. Variable TF-DNA interactions are therefore increasingly considered as key drivers of phenotypic variation. However, recent genome-wide studies revealed that the majority of variable TF-DNA binding events are not driven by sequence alterations in the motif of the studied TF. This observation implies that the molecular mechanisms underlying TF-DNA binding variation and, by extrapolation, inter-individual phenotypic variation are more complex than originally anticipated. Here, we summarize the findings that led to this important paradigm shift and review proposed mechanisms for local, proximal, or distal genetic variation-driven variable TF-DNA binding. In addition, we discuss the biomedical implications of these findings for our ability to dissect the molecular role(s) of non-coding genetic variants in complex traits, including disease susceptibility.
Collapse
Affiliation(s)
- Bart Deplancke
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, Ecole Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland.
| | - Daniel Alpern
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, Ecole Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Vincent Gardeux
- Laboratory of Systems Biology and Genetics, Institute of Bioengineering, Ecole Polytechnique Fédérale de Lausanne and Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
11
|
Peng PC, Sinha S. Quantitative modeling of gene expression using DNA shape features of binding sites. Nucleic Acids Res 2016; 44:e120. [PMID: 27257066 PMCID: PMC5291265 DOI: 10.1093/nar/gkw446] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Revised: 05/06/2016] [Accepted: 05/09/2016] [Indexed: 12/11/2022] Open
Abstract
Prediction of gene expression levels driven by regulatory sequences is pivotal in genomic biology. A major focus in transcriptional regulation is sequence-to-expression modeling, which interprets the enhancer sequence based on transcription factor concentrations and DNA binding specificities and predicts precise gene expression levels in varying cellular contexts. Such models largely rely on the position weight matrix (PWM) model for DNA binding, and the effect of alternative models based on DNA shape remains unexplored. Here, we propose a statistical thermodynamics model of gene expression using DNA shape features of binding sites. We used rigorous methods to evaluate the fits of expression readouts of 37 enhancers regulating spatial gene expression patterns in Drosophila embryo, and show that DNA shape-based models perform arguably better than PWM-based models. We also observed DNA shape captures information complimentary to the PWM, in a way that is useful for expression modeling. Furthermore, we tested if combining shape and PWM-based features provides better predictions than using either binding model alone. Our work demonstrates that the increasingly popular DNA-binding models based on local DNA shape can be useful in sequence-to-expression modeling. It also provides a framework for future studies to predict gene expression better than with PWM models alone.
Collapse
Affiliation(s)
- Pei-Chen Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
12
|
Qin W, Zhao G, Carson M, Jia C, Lu H. Knowledge-based three-body potential for transcription factor binding site prediction. IET Syst Biol 2016; 10:23-9. [PMID: 26816396 DOI: 10.1049/iet-syb.2014.0066] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
A structure-based statistical potential is developed for transcription factor binding site (TFBS) prediction. Besides the direct contact between amino acids from TFs and DNA bases, the authors also considered the influence of the neighbouring base. This three-body potential showed better discriminate powers than the two-body potential. They validate the performance of the potential in TFBS identification, binding energy prediction and binding mutation prediction.
Collapse
Affiliation(s)
- Wenyi Qin
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Guijun Zhao
- Key Laboratory of Molecular Embryology, Ministry of Health & Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, People's Republic of China
| | - Matthew Carson
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA
| | - Caiyan Jia
- School of Computer and Information Technology & Beijing Key Lab of Traffic Data Analysis, Beijing Jiaotong University, Beijing, People's Republic of China
| | - Hui Lu
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, USA.
| |
Collapse
|
13
|
Barr CL, Misener VL. Decoding the non-coding genome: elucidating genetic risk outside the coding genome. GENES, BRAIN, AND BEHAVIOR 2016; 15:187-204. [PMID: 26515765 PMCID: PMC4833497 DOI: 10.1111/gbb.12269] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Revised: 10/19/2015] [Accepted: 10/28/2015] [Indexed: 12/11/2022]
Abstract
Current evidence emerging from genome-wide association studies indicates that the genetic underpinnings of complex traits are likely attributable to genetic variation that changes gene expression, rather than (or in combination with) variation that changes protein-coding sequences. This is particularly compelling with respect to psychiatric disorders, as genetic changes in regulatory regions may result in differential transcriptional responses to developmental cues and environmental/psychosocial stressors. Until recently, however, the link between transcriptional regulation and psychiatric genetic risk has been understudied. Multiple obstacles have contributed to the paucity of research in this area, including challenges in identifying the positions of remote (distal from the promoter) regulatory elements (e.g. enhancers) and their target genes and the underrepresentation of neural cell types and brain tissues in epigenome projects - the availability of high-quality brain tissues for epigenetic and transcriptome profiling, particularly for the adolescent and developing brain, has been limited. Further challenges have arisen in the prediction and testing of the functional impact of DNA variation with respect to multiple aspects of transcriptional control, including regulatory-element interaction (e.g. between enhancers and promoters), transcription factor binding and DNA methylation. Further, the brain has uncommon DNA-methylation marks with unique genomic distributions not found in other tissues - current evidence suggests the involvement of non-CG methylation and 5-hydroxymethylation in neurodevelopmental processes but much remains unknown. We review here knowledge gaps as well as both technological and resource obstacles that will need to be overcome in order to elucidate the involvement of brain-relevant gene-regulatory variants in genetic risk for psychiatric disorders.
Collapse
Affiliation(s)
- C. L. Barr
- Toronto Western Research Institute, University Health Network, Toronto, ON, Canada
- Program in Neurosciences and Mental Health, The Hospital for Sick Children, Toronto, ON, Canada
| | - V. L. Misener
- Toronto Western Research Institute, University Health Network, Toronto, ON, Canada
| |
Collapse
|
14
|
AlQuraishi M, Tang S, Xia X. An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system. BMC Bioinformatics 2015; 16:390. [PMID: 26586237 PMCID: PMC4653904 DOI: 10.1186/s12859-015-0819-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Accepted: 11/11/2015] [Indexed: 11/28/2022] Open
Abstract
Background Molecular interactions between proteins and DNA molecules underlie many cellular processes, including transcriptional regulation, chromosome replication, and nucleosome positioning. Computational analyses of protein-DNA interactions rely on experimental data characterizing known protein-DNA interactions structurally and biochemically. While many databases exist that contain either structural or biochemical data, few integrate these two data sources in a unified fashion. Such integration is becoming increasingly critical with the rapid growth of structural and biochemical data, and the emergence of algorithms that rely on the synthesis of multiple data types to derive computational models of molecular interactions. Description We have developed an integrated affinity-structure database in which the experimental and quantitative DNA binding affinities of helix-turn-helix proteins are mapped onto the crystal structures of the corresponding protein-DNA complexes. This database provides access to: (i) protein-DNA structures, (ii) quantitative summaries of protein-DNA binding affinities using position weight matrices, and (iii) raw experimental data of protein-DNA binding instances. Critically, this database establishes a correspondence between experimental structural data and quantitative binding affinity data at the single basepair level. Furthermore, we present a novel alignment algorithm that structurally aligns the protein-DNA complexes in the database and creates a unified residue-level coordinate system for comparing the physico-chemical environments at the interface between complexes. Using this unified coordinate system, we compute the statistics of atomic interactions at the protein-DNA interface of helix-turn-helix proteins. We provide an interactive website for visualization, querying, and analyzing this database, and a downloadable version to facilitate programmatic analysis. Conclusions This database will facilitate the analysis of protein-DNA interactions and the development of programmatic computational methods that capitalize on integration of structural and biochemical datasets. The database can be accessed at http://ProteinDNA.hms.harvard.edu.
Collapse
Affiliation(s)
- Mohammed AlQuraishi
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA. .,HMS Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA.
| | - Shengdong Tang
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA.,HMS Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA
| | - Xide Xia
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA.,HMS Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue, Boston, MA, 02115, USA
| |
Collapse
|
15
|
Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol 2015; 11:e1004418. [PMID: 26291518 PMCID: PMC4546298 DOI: 10.1371/journal.pcbi.1004418] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 06/29/2015] [Indexed: 11/19/2022] Open
Abstract
Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA “intrinsic properties” (in silico predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome. Identification of transcription factor binding sites based on sequence motifs is typically accompanied by a high false positive rate. Increasing evidence suggests that there are many other factors besides DNA sequence that may affect the binding and interaction of TFs with DNA. Through the integration of sequence motif, chromatin state, and DNA structure properties, we show that TF binding can be better predicted. Moreover, considering chromatin state and DNA structure properties simultaneously yields a significant improvement. While the binding of some TFs can be readily predicted using either chromatin state information or DNA structure, other TFs need both. Thus, our findings provide insights on how different histone modifications and DNA structure properties may influence the binding of a particular TF and thus how TFs regulate gene expression. These features are referred to as sequence “intrinsic properties” because they can be predicted from sequences alone. These intrinsic properties can be used to build a TF binding prediction model that has a similar performance to considering all features. Moreover, the intrinsic property model allows TFBS predictions not only across TFs, but also across DNA-binding domain families that are present in most eukaryotes, suggesting that the model likely can be used across species.
Collapse
|
16
|
Abe N, Dror I, Yang L, Slattery M, Zhou T, Bussemaker HJ, Rohs R, Mann RS. Deconvolving the recognition of DNA shape from sequence. Cell 2015; 161:307-18. [PMID: 25843630 DOI: 10.1016/j.cell.2015.02.008] [Citation(s) in RCA: 138] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2014] [Revised: 12/08/2014] [Accepted: 01/26/2015] [Indexed: 01/25/2023]
Abstract
Protein-DNA binding is mediated by the recognition of the chemical signatures of the DNA bases and the 3D shape of the DNA molecule. Because DNA shape is a consequence of sequence, it is difficult to dissociate these modes of recognition. Here, we tease them apart in the context of Hox-DNA binding by mutating residues that, in a co-crystal structure, only recognize DNA shape. Complexes made with these mutants lose the preference to bind sequences with specific DNA shape features. Introducing shape-recognizing residues from one Hox protein to another swapped binding specificities in vitro and gene regulation in vivo. Statistical machine learning revealed that the accuracy of binding specificity predictions improves by adding shape features to a model that only depends on sequence, and feature selection identified shape features important for recognition. Thus, shape readout is a direct and independent component of binding site selection by Hox proteins.
Collapse
Affiliation(s)
- Namiko Abe
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA; Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | - Iris Dror
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA; Department of Biology, Technion - Israel Institute of Technology, Haifa 32000, Israel
| | - Lin Yang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY 10032, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA; Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA; Department of Physics and Astronomy, University of Southern California, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| | - Richard S Mann
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA; Department of Systems Biology, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
17
|
Chen W, Zhang L. The pattern of DNA cleavage intensity around indels. Sci Rep 2015; 5:8333. [PMID: 25660536 PMCID: PMC4321175 DOI: 10.1038/srep08333] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2014] [Accepted: 01/07/2015] [Indexed: 12/22/2022] Open
Abstract
Indels (insertions and deletions) are the second most common form of genetic variations in the eukaryotic genomes and are responsible for a multitude of genetic diseases. Despite its significance, detailed molecular mechanisms for indel generation are still unclear. Here we examined 2,656,597 small human and mouse germline indels, 16,742 human somatic indels, 10,599 large human insertions, and 5,822 large chimpanzee insertions and systematically analyzed the patterns of DNA cleavage intensities in the 200 base pair regions surrounding these indels. Our results show that DNA cleavage intensities close to the start and end points of indels are significantly lower than other regions, for both small human germline and somatic indels and also for mouse small indels. Compared to small indels, the patterns of DNA cleavage intensity around large indels are more complex, and there are two low intensity regions near each end of the indels that are approximately 13 bp apart from each other. Detailed analyses of a subset of indels show that there is slight difference in cleavage intensity distribution between insertion indels and deletion indels that could be contributed by their respective enrichment of different repetitive elements. These results will provide new insight into indel generation mechanisms.
Collapse
Affiliation(s)
- Wei Chen
- 1] Department of Physics, School of Sciences, Center for Genomics and Computational Biology, Hebei United University, Tangshan, China 063000 [2] Department of Computer Science, Virginia Tech, Blacksburg VA 24060
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg VA 24060
| |
Collapse
|
18
|
Dai Z, Guo D, Dai X, Xiong Y. Genome-wide analysis of transcription factor binding sites and their characteristic DNA structures. BMC Genomics 2015; 16 Suppl 3:S8. [PMID: 25708259 PMCID: PMC4331811 DOI: 10.1186/1471-2164-16-s3-s8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Transcription factors (TF) regulate gene expression by binding DNA regulatory regions. Transcription factor binding sites (TFBSs) are conserved not only in primary DNA sequences but also in DNA structures. However, the global relationship between TFs and their preferred DNA structures remains to be elucidated. Results In this paper, we have developed a computational method to generate a genome-wide landscape of TFs and their characteristic binding DNA structures in Saccharomyces cerevisiae. We revealed DNA structural features for different TFs. The structural conservation shows positional preference in TFBSs. Structural levels of DNA sequences are correlated with TF-DNA binding affinities. Conclusions We provided the genome-wide correspondences of TFs to DNA structures. Our findings will have implications in understanding TF regulatory mechanisms.
Collapse
|
19
|
Chiu TP, Yang L, Zhou T, Main BJ, Parker SCJ, Nuzhdin SV, Tullius TD, Rohs R. GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res 2014; 43:D103-9. [PMID: 25326329 PMCID: PMC4384032 DOI: 10.1093/nar/gku977] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.
Collapse
Affiliation(s)
- Tsu-Pei Chiu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Bradley J Main
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Stephen C J Parker
- Departments of Computational Medicine and Bioinformatics and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sergey V Nuzhdin
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Thomas D Tullius
- Department of Chemistry and Program in Bioinformatics, Boston University, Boston, MA 02215, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA Departments of Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
20
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 337] [Impact Index Per Article: 33.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
21
|
Kılıç S, White ER, Sagitova DM, Cornish JP, Erill I. CollecTF: a database of experimentally validated transcription factor-binding sites in Bacteria. Nucleic Acids Res 2013; 42:D156-60. [PMID: 24234444 PMCID: PMC3965012 DOI: 10.1093/nar/gkt1123] [Citation(s) in RCA: 70] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The influx of high-throughput data and the need for complex models to describe the interaction of prokaryotic transcription factors (TF) with their target sites pose new challenges for TF-binding site databases. CollecTF (http://collectf.umbc.edu) compiles data on experimentally validated, naturally occurring TF-binding sites across the Bacteria domain, placing a strong emphasis on the transparency of the curation process, the quality and availability of the stored data and fully customizable access to its records. CollecTF integrates multiple sources of data automatically and openly, allowing users to dynamically redefine binding motifs and their experimental support base. Data quality and currency are fostered in CollecTF by adopting a sustainable model that encourages direct author submissions in combination with in-house validation and curation of published literature. CollecTF entries are periodically submitted to NCBI for integration into RefSeq complete genome records as link-out features, maximizing the visibility of the data and enriching the annotation of RefSeq files with regulatory information. Seeking to facilitate comparative genomics and machine-learning analyses of regulatory interactions, in its initial release CollecTF provides domain-wide coverage of two TF families (LexA and Fur), as well as extensive representation for a clinically important bacterial family, the Vibrionaceae.
Collapse
Affiliation(s)
| | | | | | | | - Ivan Erill
- *To whom correspondence should be addressed. Tel: +1 410 455 2470; Fax: +1 410 455 3875;
| |
Collapse
|
22
|
Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, Rohs R. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2013; 42:D148-55. [PMID: 24214955 PMCID: PMC3964943 DOI: 10.1093/nar/gkt1087] [Citation(s) in RCA: 91] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.
Collapse
Affiliation(s)
- Lin Yang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA, Department of Biology, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel, Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC, Canada and Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA
| | | | | | | | | | | | | |
Collapse
|
23
|
Brand LH, Henneges C, Schüssler A, Kolukisaoglu HÜ, Koch G, Wallmeroth N, Hecker A, Thurow K, Zell A, Harter K, Wanke D. Screening for protein-DNA interactions by automatable DNA-protein interaction ELISA. PLoS One 2013; 8:e75177. [PMID: 24146751 PMCID: PMC3795721 DOI: 10.1371/journal.pone.0075177] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2013] [Accepted: 08/12/2013] [Indexed: 12/22/2022] Open
Abstract
DNA-binding proteins (DBPs), such as transcription factors, constitute about 10% of the protein-coding genes in eukaryotic genomes and play pivotal roles in the regulation of chromatin structure and gene expression by binding to short stretches of DNA. Despite their number and importance, only for a minor portion of DBPs the binding sequence had been disclosed. Methods that allow the de novo identification of DNA-binding motifs of known DBPs, such as protein binding microarray technology or SELEX, are not yet suited for high-throughput and automation. To close this gap, we report an automatable DNA-protein-interaction (DPI)-ELISA screen of an optimized double-stranded DNA (dsDNA) probe library that allows the high-throughput identification of hexanucleotide DNA-binding motifs. In contrast to other methods, this DPI-ELISA screen can be performed manually or with standard laboratory automation. Furthermore, output evaluation does not require extensive computational analysis to derive a binding consensus. We could show that the DPI-ELISA screen disclosed the full spectrum of binding preferences for a given DBP. As an example, AtWRKY11 was used to demonstrate that the automated DPI-ELISA screen revealed the entire range of in vitro binding preferences. In addition, protein extracts of AtbZIP63 and the DNA-binding domain of AtWRKY33 were analyzed, which led to a refinement of their known DNA-binding consensi. Finally, we performed a DPI-ELISA screen to disclose the DNA-binding consensus of a yet uncharacterized putative DBP, AtTIFY1. A palindromic TGATCA-consensus was uncovered and we could show that the GATC-core is compulsory for AtTIFY1 binding. This specific interaction between AtTIFY1 and its DNA-binding motif was confirmed by in vivo plant one-hybrid assays in protoplasts. Thus, the value and applicability of the DPI-ELISA screen for de novo binding site identification of DBPs, also under automatized conditions, is a promising approach for a deeper understanding of gene regulation in any organism of choice.
Collapse
Affiliation(s)
- Luise H. Brand
- Plant Physiology, Center for Plant Molecular Biology, University of Tuebingen, Tuebingen, Germany
| | - Carsten Henneges
- Cognitive Systems, Center for Bioinformatics, University of Tuebingen, Tuebingen, Germany
| | - Axel Schüssler
- Cognitive Systems, Center for Bioinformatics, University of Tuebingen, Tuebingen, Germany
| | - H. Üner Kolukisaoglu
- Plant Physiology, Center for Plant Molecular Biology, University of Tuebingen, Tuebingen, Germany
- Center for Life Science Automation, Rostock, Germany
| | - Grit Koch
- Center for Life Science Automation, Rostock, Germany
| | - Niklas Wallmeroth
- Plant Physiology, Center for Plant Molecular Biology, University of Tuebingen, Tuebingen, Germany
| | - Andreas Hecker
- Plant Physiology, Center for Plant Molecular Biology, University of Tuebingen, Tuebingen, Germany
| | | | - Andreas Zell
- Cognitive Systems, Center for Bioinformatics, University of Tuebingen, Tuebingen, Germany
| | - Klaus Harter
- Plant Physiology, Center for Plant Molecular Biology, University of Tuebingen, Tuebingen, Germany
| | - Dierk Wanke
- Plant Physiology, Center for Plant Molecular Biology, University of Tuebingen, Tuebingen, Germany
- * E-mail:
| |
Collapse
|
24
|
Nowak-Lovato K, Alexandrov LB, Banisadr A, Bauer AL, Bishop AR, Usheva A, Mu F, Hong-Geller E, Rasmussen KØ, Hlavacek WS, Alexandrov BS. Binding of nucleoid-associated protein fis to DNA is regulated by DNA breathing dynamics. PLoS Comput Biol 2013; 9:e1002881. [PMID: 23341768 PMCID: PMC3547798 DOI: 10.1371/journal.pcbi.1002881] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Accepted: 11/29/2012] [Indexed: 12/23/2022] Open
Abstract
Physicochemical properties of DNA, such as shape, affect protein-DNA recognition. However, the properties of DNA that are most relevant for predicting the binding sites of particular transcription factors (TFs) or classes of TFs have yet to be fully understood. Here, using a model that accurately captures the melting behavior and breathing dynamics (spontaneous local openings of the double helix) of double-stranded DNA, we simulated the dynamics of known binding sites of the TF and nucleoid-associated protein Fis in Escherichia coli. Our study involves simulations of breathing dynamics, analysis of large published in vitro and genomic datasets, and targeted experimental tests of our predictions. Our simulation results and available in vitro binding data indicate a strong correlation between DNA breathing dynamics and Fis binding. Indeed, we can define an average DNA breathing profile that is characteristic of Fis binding sites. This profile is significantly enriched among the identified in vivo E. coli Fis binding sites. To test our understanding of how Fis binding is influenced by DNA breathing dynamics, we designed base-pair substitutions, mismatch, and methylation modifications of DNA regions that are known to interact (or not interact) with Fis. The goal in each case was to make the local DNA breathing dynamics either closer to or farther from the breathing profile characteristic of a strong Fis binding site. For the modified DNA segments, we found that Fis-DNA binding, as assessed by gel-shift assay, changed in accordance with our expectations. We conclude that Fis binding is associated with DNA breathing dynamics, which in turn may be regulated by various nucleotide modifications. Cellular transcription factors (TFs) are proteins that regulate gene expression, and thereby cellular activity and fate, by binding to specific DNA segments. The physicochemical determinants of protein-DNA binding specificity are not completely understood. Here, we report that the propensity of transient opening and re-closing of the double helix, resulting from thermal fluctuations, aka “DNA breathing” or “DNA bubbles,” can be associated with binding affinity in the case of Fis, a well-studied nucleoid-associated protein in Escherichia coli. We found that a particular breathing profile is characteristic of high-affinity Fis binding sites and that DNA fragments known to bind Fis in vivo are statistically enriched for this profile. Furthermore, we used simulations of DNA breathing dynamics to guide design of gel-shift experiments aimed at testing the idea that local breathing influences Fis binding. As a result, we show that via nucleotide modifications but without modifying nucleotides that directly contact Fis, we were able to transform a low-affinity Fis binding site into a high-affinity site and vice versa. The nucleotide modifications were designed only based on DNA breathing simulations. Our study suggests that strong Fis-DNA binding depends on DNA breathing - a novel physicochemical characteristic that could be used for prediction and rational design of TF binding sites.
Collapse
Affiliation(s)
- Kristy Nowak-Lovato
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Ludmil B. Alexandrov
- Cancer Genome Project, Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Afsheen Banisadr
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Amy L. Bauer
- X-Theoretical Design Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Alan R. Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Anny Usheva
- Harvard Medical School, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
| | - Fangping Mu
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Elizabeth Hong-Geller
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Kim Ø. Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - William S. Hlavacek
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail: (WSH); (BSA)
| | - Boian S. Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail: (WSH); (BSA)
| |
Collapse
|
25
|
Functional Implications of Local DNA Structures in Regulatory Motifs. ScientificWorldJournal 2013; 2013:965752. [PMID: 23766731 PMCID: PMC3666281 DOI: 10.1155/2013/965752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2013] [Accepted: 04/23/2013] [Indexed: 11/19/2022] Open
Abstract
The three-dimensional structure of DNA has been proposed to be a major determinant for functional transcription factors (TFs) and DNA interaction. Here, we use hydroxyl radical cleavage pattern as a measure of local DNA structure. We compared the conservation between DNA sequence and structure in terms of information content and attempted to assess the functional implications of DNA structures in regulatory motifs. We used statistical methods to evaluate the structural divergence of substituting a single position within a binding site and applied them to a collection of putative regulatory motifs. The following are our major observations: (i) we observed more information in structural alignment than in the corresponding sequence alignment for most of the transcriptional factors; (ii) for each TF, majority of positions have more information in the structural alignment as compared to the sequence alignment; (iii) we further defined a DNA structural divergence score (SD score) for each wild-type and mutant pair that is distinguished by single-base mutation. The SD score for benign mutations is significantly lower than that of switch mutations. This indicates structural conservation is also important for TFBS to be functional and DNA structures will provide previously unappreciated information for TF to realize the binding specificity.
Collapse
|