1
|
Wang K, Zeng X, Zhou J, Liu F, Luan X, Wang X. BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform 2024; 25:bbae195. [PMID: 38701417 PMCID: PMC11066948 DOI: 10.1093/bib/bbae195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Revised: 03/26/2024] [Accepted: 04/10/2024] [Indexed: 05/05/2024] Open
Abstract
Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.
Collapse
Affiliation(s)
- Kai Wang
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Xuan Zeng
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Jingwen Zhou
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Key Laboratory of Industrial Biotechnology, Ministry of Education and School of Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Engineering Research Center of Ministry of Education on Food Synthetic Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Jiangsu Province Engineering Research Center of Food Synthetic Biotechnology, Jiangnan University, Wuxi 214122, China
| | - Fei Liu
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Xiaoli Luan
- Key Laboratory of Advanced Process Control for Light Industry (Ministry of Education), School of Internet of Things Engineering, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| | - Xinglong Wang
- Science Center for Future Foods, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
- Key Laboratory of Industrial Biotechnology, Ministry of Education and School of Biotechnology, Jiangnan University, 1800 Lihu Road, Wuxi, Jiangsu 214122, China
| |
Collapse
|
2
|
Yu Y, Ding P, Gao H, Liu G, Zhang F, Yu B. Cooperation of local features and global representations by a dual-branch network for transcription factor binding sites prediction. Brief Bioinform 2023; 24:7030619. [PMID: 36748992 DOI: 10.1093/bib/bbad036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 01/03/2023] [Accepted: 01/18/2023] [Indexed: 02/08/2023] Open
Abstract
Interactions between DNA and transcription factors (TFs) play an essential role in understanding transcriptional regulation mechanisms and gene expression. Due to the large accumulation of training data and low expense, deep learning methods have shown huge potential in determining the specificity of TFs-DNA interactions. Convolutional network-based and self-attention network-based methods have been proposed for transcription factor binding sites (TFBSs) prediction. Convolutional operations are efficient to extract local features but easy to ignore global information, while self-attention mechanisms are expert in capturing long-distance dependencies but difficult to pay attention to local feature details. To discover comprehensive features for a given sequence as far as possible, we propose a Dual-branch model combining Self-Attention and Convolution, dubbed as DSAC, which fuses local features and global representations in an interactive way. In terms of features, convolution and self-attention contribute to feature extraction collaboratively, enhancing the representation learning. In terms of structure, a lightweight but efficient architecture of network is designed for the prediction, in particular, the dual-branch structure makes the convolution and the self-attention mechanism can be fully utilized to improve the predictive ability of our model. The experiment results on 165 ChIP-seq datasets show that DSAC obviously outperforms other five deep learning based methods and demonstrate that our model can effectively predict TFBSs based on sequence feature alone. The source code of DSAC is available at https://github.com/YuBinLab-QUST/DSAC/.
Collapse
Affiliation(s)
- Yutong Yu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Hongli Gao
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Guozhu Liu
- College of Information Science and Technology, Qingdao University of Science and Technology, China
| | - Fa Zhang
- School of Medical Technology, Beijing Institute of Technology, China
| | - Bin Yu
- College of Information Science and Technology, School of Data Science, Qingdao University of Science and Technology, China
| |
Collapse
|
3
|
Zhao L, Duan X, Liu H. Novel Grade Classification Tool with Lipidomics for Indica Rice Eating Quality Evaluation. Foods 2023; 12:foods12050944. [PMID: 36900461 PMCID: PMC10000924 DOI: 10.3390/foods12050944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Revised: 01/16/2023] [Accepted: 01/17/2023] [Indexed: 02/25/2023] Open
Abstract
The eating quality evaluation of rice is raising further concerns among researchers and consumers. This research is aimed to apply lipidomics in determining the distinction between different grades of indica rice and establishing effective models for rice quality evaluation. Herein, a high-throughput ultrahigh-performance liquid chromatography coupled with quadrupole time-of-flight (UPLC-QTOF/MS) method for comprehensive lipidomics profiling of rice was developed. Then, a total of 42 significantly different lipids among 3 sensory levels were identified and quantified for indica rice. The orthogonal partial least-squares discriminant analysis (OPLS-DA) models with the two sets of differential lipids showed clear distinction among three grades of indica rice. A correlation coefficient of 0.917 was obtained between the practical and model-predicted tasting scores of indica rice. Random forest (RF) results further verified the OPLS-DA model, and the accuracy of this method for grade prediction was 90.20%. Thus, this established approach was an efficient method for the eating grade prediction of indica rice.
Collapse
Affiliation(s)
- Luyao Zhao
- Academy of National Food and Strategic Reserves Administration, Beijing 100037, China
- Correspondence:
| | - Xiaoliang Duan
- Academy of National Food and Strategic Reserves Administration, Beijing 100037, China
| | - Hongbin Liu
- China Animal Disease Control Center, Beijing 100020, China
| |
Collapse
|
4
|
Yan W, Li Z, Pian C, Wu Y. PlantBind: an attention-based multi-label neural network for predicting plant transcription factor binding sites. Brief Bioinform 2022; 23:6713513. [PMID: 36155619 DOI: 10.1093/bib/bbac425] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Revised: 08/29/2022] [Accepted: 08/31/2022] [Indexed: 12/14/2022] Open
Abstract
Identification of transcription factor binding sites (TFBSs) is essential to understanding of gene regulation. Designing computational models for accurate prediction of TFBSs is crucial because it is not feasible to experimentally assay all transcription factors (TFs) in all sequenced eukaryotic genomes. Although many methods have been proposed for the identification of TFBSs in humans, methods designed for plants are comparatively underdeveloped. Here, we present PlantBind, a method for integrated prediction and interpretation of TFBSs based on DNA sequences and DNA shape profiles. Built on an attention-based multi-label deep learning framework, PlantBind not only simultaneously predicts the potential binding sites of 315 TFs, but also identifies the motifs bound by transcription factors. During the training process, this model revealed a strong similarity among TF family members with respect to target binding sequences. Trans-species prediction performance using four Zea mays TFs demonstrated the suitability of this model for transfer learning. Overall, this study provides an effective solution for identifying plant TFBSs, which will promote greater understanding of transcriptional regulatory mechanisms in plants.
Collapse
Affiliation(s)
| | - Zutan Li
- Nanjing Agricultur al University
| | - Cong Pian
- College of Sciences at Nanjing Agricultural University
| | - Yufeng Wu
- State Key Laboratory for Crop Genetics and Germplasm Enhancement, Bioinformatics Center, College of Agriculture, Academy for Advanced Interdisciplinary Studies at Nanjing Agricultural University
| |
Collapse
|
5
|
Construction of a Diagnostic Model for Lymph Node Metastasis of the Papillary Thyroid Carcinoma Using Preoperative Ultrasound Features and Imaging Omics. JOURNAL OF HEALTHCARE ENGINEERING 2022; 2022:1872412. [PMID: 35178222 PMCID: PMC8846989 DOI: 10.1155/2022/1872412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 12/14/2021] [Accepted: 01/07/2022] [Indexed: 11/17/2022]
Abstract
In this paper, we mainly adopted 337 patients who had undergone the surgery on lymph node metastasis of papillary thyroid carcinoma (PTC) as the sample population. In order to provide clinical reference for the intelligent decision-making in treatment plan and improvement of prognosis, we utilized ultrasound features and imaging features to construct five early diagnosis models for patients based on the ultrasound features, imaging features, and combined features. The model integrated with broad learning system (BLS) showed the best performance, with the area under the curve (AUC) of 0.857 (95% confidence interval (CI): 0.811–0.902)) and the accuracy of 0.805 (95% CI: 0.759–0.850). For demographic and clinical features, the prediction effect was also good, with the AUC more than 0.700.
Collapse
|
6
|
Han H, Ding C, Cheng X, Sang X, Liu T. iT4SE-EP: Accurate Identification of Bacterial Type IV Secreted Effectors by Exploring Evolutionary Features from Two PSI-BLAST Profiles. Molecules 2021; 26:molecules26092487. [PMID: 33923273 PMCID: PMC8123216 DOI: 10.3390/molecules26092487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 11/16/2022] Open
Abstract
Many gram-negative bacteria use type IV secretion systems to deliver effector molecules to a wide range of target cells. These substrate proteins, which are called type IV secreted effectors (T4SE), manipulate host cell processes during infection, often resulting in severe diseases or even death of the host. Therefore, identification of putative T4SEs has become a very active research topic in bioinformatics due to its vital roles in understanding host-pathogen interactions. PSI-BLAST profiles have been experimentally validated to provide important and discriminatory evolutionary information for various protein classification tasks. In the present study, an accurate computational predictor termed iT4SE-EP was developed for identifying T4SEs by extracting evolutionary features from the position-specific scoring matrix and the position-specific frequency matrix profiles. First, four types of encoding strategies were designed to transform protein sequences into fixed-length feature vectors based on the two profiles. Then, the feature selection technique based on the random forest algorithm was utilized to reduce redundant or irrelevant features without much loss of information. Finally, the optimal features were input into a support vector machine classifier to carry out the prediction of T4SEs. Our experimental results demonstrated that iT4SE-EP outperformed most of existing methods based on the independent dataset test.
Collapse
|
7
|
Chen C, Hou J, Shi X, Yang H, Birchler JA, Cheng J. DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. BMC Bioinformatics 2021; 22:38. [PMID: 33522898 PMCID: PMC7852092 DOI: 10.1186/s12859-020-03952-1] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 12/29/2020] [Indexed: 12/21/2022] Open
Abstract
Background Due to the complexity of the biological systems, the prediction of the potential DNA binding sites for transcription factors remains a difficult problem in computational biology. Genomic DNA sequences and experimental results from parallel sequencing provide available information about the affinity and accessibility of genome and are commonly used features in binding sites prediction. The attention mechanism in deep learning has shown its capability to learn long-range dependencies from sequential data, such as sentences and voices. Until now, no study has applied this approach in binding site inference from massively parallel sequencing data. The successful applications of attention mechanism in similar input contexts motivate us to build and test new methods that can accurately determine the binding sites of transcription factors. Results In this study, we propose a novel tool (named DeepGRN) for transcription factors binding site prediction based on the combination of two components: single attention module and pairwise attention module. The performance of our methods is evaluated on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge datasets. The results show that DeepGRN achieves higher unified scores in 6 of 13 targets than any of the top four methods in the DREAM challenge. We also demonstrate that the attention weights learned by the model are correlated with potential informative inputs, such as DNase-Seq coverage and motifs, which provide possible explanations for the predictive improvements in DeepGRN. Conclusions DeepGRN can automatically and effectively predict transcription factor binding sites from DNA sequences and DNase-Seq coverage. Furthermore, the visualization techniques we developed for the attention modules help to interpret how critical patterns from different types of input features are recognized by our model.
Collapse
Affiliation(s)
- Chen Chen
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, 63103, USA
| | - Xiaowen Shi
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Hua Yang
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - James A Birchler
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Jianlin Cheng
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
8
|
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2019; 50:71-91. [PMID: 30467459 PMCID: PMC6242341 DOI: 10.1016/j.inffus.2018.09.012] [Citation(s) in RCA: 222] [Impact Index Per Article: 44.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Computer Science, Stanford University,
Stanford, CA, USA
| | - Francis Nguyen
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
| | - Bo Wang
- Hikvision Research Institute, Santa Clara, CA, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University,
Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Anna Goldenberg
- Genetics & Genome Biology, SickKids Research Institute,
Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Michael M. Hoffman
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| |
Collapse
|
9
|
Xu Y, Yang Y, Ding J, Li C. iGlu-Lys: A Predictor for Lysine Glutarylation Through Amino Acid Pair Order Features. IEEE Trans Nanobioscience 2018; 17:394-401. [DOI: 10.1109/tnb.2018.2848673] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
10
|
Jankowsky E, Harris ME. Mapping specificity landscapes of RNA-protein interactions by high throughput sequencing. Methods 2017; 118-119:111-118. [PMID: 28263887 DOI: 10.1016/j.ymeth.2017.03.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2016] [Revised: 02/07/2017] [Accepted: 03/01/2017] [Indexed: 12/28/2022] Open
Abstract
To function in a biological setting, RNA binding proteins (RBPs) have to discriminate between alternative binding sites in RNAs. This discrimination can occur in the ground state of an RNA-protein binding reaction, in its transition state, or in both. The extent by which RBPs discriminate at these reaction states defines RBP specificity landscapes. Here, we describe the HiTS-Kin and HiTS-EQ techniques, which combine kinetic and equilibrium binding experiments with high throughput sequencing to quantitatively assess substrate discrimination for large numbers of substrate variants at ground and transition states of RNA-protein binding reactions. We discuss experimental design, practical considerations and data analysis and outline how a combination of HiTS-Kin and HiTS-EQ allows the mapping of RBP specificity landscapes.
Collapse
Affiliation(s)
- Eckhard Jankowsky
- Center for RNA Molecular Biology, School of Medicine, Case Western Reserve University, 1099 Euclid Ave Cleveland, OH 44106, United States; Department of Biochemistry, School of Medicine, Case Western Reserve University, 1099 Euclid Ave Cleveland, OH 44106, United States.
| | - Michael E Harris
- Department of Biochemistry, School of Medicine, Case Western Reserve University, 1099 Euclid Ave Cleveland, OH 44106, United States
| |
Collapse
|
11
|
Xu Y, Li L, Ding J, Wu LY, Mai G, Zhou F. Gly-PseAAC: Identifying protein lysine glycation through sequences. Gene 2017; 602:1-7. [DOI: 10.1016/j.gene.2016.11.021] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2016] [Revised: 08/29/2016] [Accepted: 11/10/2016] [Indexed: 11/29/2022]
|
12
|
Peng PC, Sinha S. Quantitative modeling of gene expression using DNA shape features of binding sites. Nucleic Acids Res 2016; 44:e120. [PMID: 27257066 PMCID: PMC5291265 DOI: 10.1093/nar/gkw446] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Revised: 05/06/2016] [Accepted: 05/09/2016] [Indexed: 12/11/2022] Open
Abstract
Prediction of gene expression levels driven by regulatory sequences is pivotal in genomic biology. A major focus in transcriptional regulation is sequence-to-expression modeling, which interprets the enhancer sequence based on transcription factor concentrations and DNA binding specificities and predicts precise gene expression levels in varying cellular contexts. Such models largely rely on the position weight matrix (PWM) model for DNA binding, and the effect of alternative models based on DNA shape remains unexplored. Here, we propose a statistical thermodynamics model of gene expression using DNA shape features of binding sites. We used rigorous methods to evaluate the fits of expression readouts of 37 enhancers regulating spatial gene expression patterns in Drosophila embryo, and show that DNA shape-based models perform arguably better than PWM-based models. We also observed DNA shape captures information complimentary to the PWM, in a way that is useful for expression modeling. Furthermore, we tested if combining shape and PWM-based features provides better predictions than using either binding model alone. Our work demonstrates that the increasingly popular DNA-binding models based on local DNA shape can be useful in sequence-to-expression modeling. It also provides a framework for future studies to predict gene expression better than with PWM models alone.
Collapse
Affiliation(s)
- Pei-Chen Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
13
|
Contribution of Sequence Motif, Chromatin State, and DNA Structure Features to Predictive Models of Transcription Factor Binding in Yeast. PLoS Comput Biol 2015; 11:e1004418. [PMID: 26291518 PMCID: PMC4546298 DOI: 10.1371/journal.pcbi.1004418] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 06/29/2015] [Indexed: 11/19/2022] Open
Abstract
Transcription factor (TF) binding is determined by the presence of specific sequence motifs (SM) and chromatin accessibility, where the latter is influenced by both chromatin state (CS) and DNA structure (DS) properties. Although SM, CS, and DS have been used to predict TF binding sites, a predictive model that jointly considers CS and DS has not been developed to predict either TF-specific binding or general binding properties of TFs. Using budding yeast as model, we found that machine learning classifiers trained with either CS or DS features alone perform better in predicting TF-specific binding compared to SM-based classifiers. In addition, simultaneously considering CS and DS further improves the accuracy of the TF binding predictions, indicating the highly complementary nature of these two properties. The contributions of SM, CS, and DS features to binding site predictions differ greatly between TFs, allowing TF-specific predictions and potentially reflecting different TF binding mechanisms. In addition, a "TF-agnostic" predictive model based on three DNA “intrinsic properties” (in silico predicted nucleosome occupancy, major groove geometry, and dinucleotide free energy) that can be calculated from genomic sequences alone has performance that rivals the model incorporating experiment-derived data. This intrinsic property model allows prediction of binding regions not only across TFs, but also across DNA-binding domain families with distinct structural folds. Furthermore, these predicted binding regions can help identify TF binding sites that have a significant impact on target gene expression. Because the intrinsic property model allows prediction of binding regions across DNA-binding domain families, it is TF agnostic and likely describes general binding potential of TFs. Thus, our findings suggest that it is feasible to establish a TF agnostic model for identifying functional regulatory regions in potentially any sequenced genome. Identification of transcription factor binding sites based on sequence motifs is typically accompanied by a high false positive rate. Increasing evidence suggests that there are many other factors besides DNA sequence that may affect the binding and interaction of TFs with DNA. Through the integration of sequence motif, chromatin state, and DNA structure properties, we show that TF binding can be better predicted. Moreover, considering chromatin state and DNA structure properties simultaneously yields a significant improvement. While the binding of some TFs can be readily predicted using either chromatin state information or DNA structure, other TFs need both. Thus, our findings provide insights on how different histone modifications and DNA structure properties may influence the binding of a particular TF and thus how TFs regulate gene expression. These features are referred to as sequence “intrinsic properties” because they can be predicted from sequences alone. These intrinsic properties can be used to build a TF binding prediction model that has a similar performance to considering all features. Moreover, the intrinsic property model allows TFBS predictions not only across TFs, but also across DNA-binding domain families that are present in most eukaryotes, suggesting that the model likely can be used across species.
Collapse
|
14
|
Abstract
To fully understand the regulation of gene expression, it is critical to quantitatively define whether and how RNA-binding proteins (RBPs) discriminate between alternative binding sites in RNAs. Here, we describe new methods that measure protein binding to large numbers of RNA variants, and ways to analyse and interpret data obtained by these approaches, including affinity distributions and free energy landscapes. We discuss how the new methodologies and the associated concepts enable the development of inclusive, quantitative models for RNA-protein interactions that transcend the traditional binary classification of RBPs as either specific or nonspecific.
Collapse
|
15
|
Xu Y, Ding YX, Ding J, Wu LY, Deng NY. Phogly–PseAAC: Prediction of lysine phosphoglycerylation in proteins incorporating with position-specific propensity. J Theor Biol 2015; 379:10-5. [DOI: 10.1016/j.jtbi.2015.04.016] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2015] [Revised: 03/17/2015] [Accepted: 04/11/2015] [Indexed: 01/04/2023]
|
16
|
Yang J, Ramsey SA. A DNA shape-based regulatory score improves position-weight matrix-based recognition of transcription factor binding sites. Bioinformatics 2015; 31:3445-50. [PMID: 26130577 DOI: 10.1093/bioinformatics/btv391] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2014] [Accepted: 06/24/2015] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION The position-weight matrix (PWM) is a useful representation of a transcription factor binding site (TFBS) sequence pattern because the PWM can be estimated from a small number of representative TFBS sequences. However, because the PWM probability model assumes independence between individual nucleotide positions, the PWMs for some TFs poorly discriminate binding sites from non-binding-sites that have similar sequence content. Since the local three-dimensional DNA structure ('shape') is a determinant of TF binding specificity and since DNA shape has a significant sequence-dependence, we combined DNA shape-derived features into a TF-generalized regulatory score and tested whether the score could improve PWM-based discrimination of TFBS from non-binding-sites. RESULTS We compared a traditional PWM model to a model that combines the PWM with a DNA shape feature-based regulatory potential score, for accuracy in detecting binding sites for 75 vertebrate transcription factors. The PWM+shape model was more accurate than the PWM-only model, for 45% of TFs tested, with no significant loss of accuracy for the remaining TFs. AVAILABILITY AND IMPLEMENTATION The shape-based model is available as an open-source R package at that is archived on the GitHub software repository at https://github.com/ramseylab/regshape/. CONTACT stephen.ramsey@oregonstate.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Stephen A Ramsey
- Department of Biomedical Sciences and School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR, USA
| |
Collapse
|
17
|
Hennig J, Sattler M. Deciphering the protein-RNA recognition code: Combining large-scale quantitative methods with structural biology. Bioessays 2015; 37:899-908. [DOI: 10.1002/bies.201500033] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Affiliation(s)
- Janosch Hennig
- Institute of Structural Biology; Helmholtz Zentrum M; ü; nchen; München Germany
- Department Chemie; Center for Integrated Protein Science Munich at Biomolecular NMR Spectroscopy; Technische Universität München; Garching Germany
| | - Michael Sattler
- Institute of Structural Biology; Helmholtz Zentrum M; ü; nchen; München Germany
- Department Chemie; Center for Integrated Protein Science Munich at Biomolecular NMR Spectroscopy; Technische Universität München; Garching Germany
| |
Collapse
|
18
|
Abstract
Recent advances in experimental and computational methodologies are enabling ultra-high resolution genome-wide profiles of protein-DNA binding events. For example, the ChIP-exo protocol precisely characterizes protein-DNA cross-linking patterns by combining chromatin immunoprecipitation (ChIP) with 5' → 3' exonuclease digestion. Similarly, deeply sequenced chromatin accessibility assays (e.g. DNase-seq and ATAC-seq) enable the detection of protected footprints at protein-DNA binding sites. With these techniques and others, we have the potential to characterize the individual nucleotides that interact with transcription factors, nucleosomes, RNA polymerases and other regulatory proteins in a particular cellular context. In this review, we explain the experimental assays and computational analysis methods that enable high-resolution profiling of protein-DNA binding events. We discuss the challenges and opportunities associated with such approaches.
Collapse
Affiliation(s)
- Shaun Mahony
- a Department of Biochemistry & Molecular Biology , Center for Eukaryotic Gene Regulation, The Pennsylvania State University , University Park , PA , USA
| | - B Franklin Pugh
- a Department of Biochemistry & Molecular Biology , Center for Eukaryotic Gene Regulation, The Pennsylvania State University , University Park , PA , USA
| |
Collapse
|
19
|
Abe N, Dror I, Yang L, Slattery M, Zhou T, Bussemaker HJ, Rohs R, Mann RS. Deconvolving the recognition of DNA shape from sequence. Cell 2015; 161:307-18. [PMID: 25843630 DOI: 10.1016/j.cell.2015.02.008] [Citation(s) in RCA: 138] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2014] [Revised: 12/08/2014] [Accepted: 01/26/2015] [Indexed: 01/25/2023]
Abstract
Protein-DNA binding is mediated by the recognition of the chemical signatures of the DNA bases and the 3D shape of the DNA molecule. Because DNA shape is a consequence of sequence, it is difficult to dissociate these modes of recognition. Here, we tease them apart in the context of Hox-DNA binding by mutating residues that, in a co-crystal structure, only recognize DNA shape. Complexes made with these mutants lose the preference to bind sequences with specific DNA shape features. Introducing shape-recognizing residues from one Hox protein to another swapped binding specificities in vitro and gene regulation in vivo. Statistical machine learning revealed that the accuracy of binding specificity predictions improves by adding shape features to a model that only depends on sequence, and feature selection identified shape features important for recognition. Thus, shape readout is a direct and independent component of binding site selection by Hox proteins.
Collapse
Affiliation(s)
- Namiko Abe
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA; Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | - Iris Dror
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA; Department of Biology, Technion - Israel Institute of Technology, Haifa 32000, Israel
| | - Lin Yang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY 10032, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA; Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA; Department of Physics and Astronomy, University of Southern California, Los Angeles, CA 90089, USA; Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| | - Richard S Mann
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA; Department of Systems Biology, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
20
|
Dai Z, Guo D, Dai X, Xiong Y. Genome-wide analysis of transcription factor binding sites and their characteristic DNA structures. BMC Genomics 2015; 16 Suppl 3:S8. [PMID: 25708259 PMCID: PMC4331811 DOI: 10.1186/1471-2164-16-s3-s8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Transcription factors (TF) regulate gene expression by binding DNA regulatory regions. Transcription factor binding sites (TFBSs) are conserved not only in primary DNA sequences but also in DNA structures. However, the global relationship between TFs and their preferred DNA structures remains to be elucidated. Results In this paper, we have developed a computational method to generate a genome-wide landscape of TFs and their characteristic binding DNA structures in Saccharomyces cerevisiae. We revealed DNA structural features for different TFs. The structural conservation shows positional preference in TFBSs. Structural levels of DNA sequences are correlated with TF-DNA binding affinities. Conclusions We provided the genome-wide correspondences of TFs to DNA structures. Our findings will have implications in understanding TF regulatory mechanisms.
Collapse
|
21
|
Chiu TP, Yang L, Zhou T, Main BJ, Parker SCJ, Nuzhdin SV, Tullius TD, Rohs R. GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res 2014; 43:D103-9. [PMID: 25326329 PMCID: PMC4384032 DOI: 10.1093/nar/gku977] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Many regulatory mechanisms require a high degree of specificity in protein-DNA binding. Nucleotide sequence does not provide an answer to the question of why a protein binds only to a small subset of the many putative binding sites in the genome that share the same core motif. Whereas higher-order effects, such as chromatin accessibility, cooperativity and cofactors, have been described, DNA shape recently gained attention as another feature that fine-tunes the DNA binding specificities of some transcription factor families. Our Genome Browser for DNA shape annotations (GBshape; freely available at http://rohslab.cmb.usc.edu/GBshape/) provides minor groove width, propeller twist, roll, helix twist and hydroxyl radical cleavage predictions for the entire genomes of 94 organisms. Additional genomes can easily be added using the GBshape framework. GBshape can be used to visualize DNA shape annotations qualitatively in a genome browser track format, and to download quantitative values of DNA shape features as a function of genomic position at nucleotide resolution. As biological applications, we illustrate the periodicity of DNA shape features that are present in nucleosome-occupied sequences from human, fly and worm, and we demonstrate structural similarities between transcription start sites in the genomes of four Drosophila species.
Collapse
Affiliation(s)
- Tsu-Pei Chiu
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Bradley J Main
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Stephen C J Parker
- Departments of Computational Medicine and Bioinformatics and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Sergey V Nuzhdin
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Thomas D Tullius
- Department of Chemistry and Program in Bioinformatics, Boston University, Boston, MA 02215, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA Departments of Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
22
|
Slattery M, Zhou T, Yang L, Dantas Machado AC, Gordân R, Rohs R. Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 2014; 39:381-99. [PMID: 25129887 DOI: 10.1016/j.tibs.2014.07.002] [Citation(s) in RCA: 352] [Impact Index Per Article: 35.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Revised: 07/11/2014] [Accepted: 07/15/2014] [Indexed: 12/21/2022]
Abstract
Transcription factors (TFs) influence cell fate by interpreting the regulatory DNA within a genome. TFs recognize DNA in a specific manner; the mechanisms underlying this specificity have been identified for many TFs based on 3D structures of protein-DNA complexes. More recently, structural views have been complemented with data from high-throughput in vitro and in vivo explorations of the DNA-binding preferences of many TFs. Together, these approaches have greatly expanded our understanding of TF-DNA interactions. However, the mechanisms by which TFs select in vivo binding sites and alter gene expression remain unclear. Recent work has highlighted the many variables that influence TF-DNA binding, while demonstrating that a biophysical understanding of these many factors will be central to understanding TF function.
Collapse
Affiliation(s)
- Matthew Slattery
- Department of Biomedical Sciences, University of Minnesota Medical School, Duluth, MN 55812, USA; Developmental Biology Center, University of Minnesota, Minneapolis, MN 55455, USA.
| | - Tianyin Zhou
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Ana Carolina Dantas Machado
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Raluca Gordân
- Center for Genomic and Computational Biology, Departments of Biostatistics and Bioinformatics, Computer Science, and Molecular Genetics and Microbiology, Duke University, Durham, NC 27708, USA.
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
23
|
Yang L, Zhou T, Dror I, Mathelier A, Wasserman WW, Gordân R, Rohs R. TFBSshape: a motif database for DNA shape features of transcription factor binding sites. Nucleic Acids Res 2013; 42:D148-55. [PMID: 24214955 PMCID: PMC3964943 DOI: 10.1093/nar/gkt1087] [Citation(s) in RCA: 91] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Transcription factor binding sites (TFBSs) are most commonly characterized by the nucleotide preferences at each position of the DNA target. Whereas these sequence motifs are quite accurate descriptions of DNA binding specificities of transcription factors (TFs), proteins recognize DNA as a three-dimensional object. DNA structural features refine the description of TF binding specificities and provide mechanistic insights into protein-DNA recognition. Existing motif databases contain extensive nucleotide sequences identified in binding experiments based on their selection by a TF. To utilize DNA shape information when analysing the DNA binding specificities of TFs, we developed a new tool, the TFBSshape database (available at http://rohslab.cmb.usc.edu/TFBSshape/), for calculating DNA structural features from nucleotide sequences provided by motif databases. The TFBSshape database can be used to generate heat maps and quantitative data for DNA structural features (i.e., minor groove width, roll, propeller twist and helix twist) for 739 TF datasets from 23 different species derived from the motif databases JASPAR and UniPROBE. As demonstrated for the basic helix-loop-helix and homeodomain TF families, our TFBSshape database can be used to compare, qualitatively and quantitatively, the DNA binding specificities of closely related TFs and, thus, uncover differential DNA binding specificities that are not apparent from nucleotide sequence alone.
Collapse
Affiliation(s)
- Lin Yang
- Molecular and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA, Department of Biology, Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel, Centre for Molecular Medicine and Therapeutics, University of British Columbia, Vancouver, BC, Canada and Institute for Genome Sciences & Policy, Duke University, Durham, NC 27708, USA
| | | | | | | | | | | | | |
Collapse
|
24
|
Computational studies on Alzheimer’s disease associated pathways and regulatory patterns using microarray gene expression and network data: Revealed association with aging and other diseases. J Theor Biol 2013; 334:109-21. [DOI: 10.1016/j.jtbi.2013.06.013] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2012] [Revised: 06/07/2013] [Accepted: 06/10/2013] [Indexed: 12/31/2022]
|
25
|
Broos S, Soete A, Hooghe B, Moran R, van Roy F, De Bleser P. PhysBinder: Improving the prediction of transcription factor binding sites by flexible inclusion of biophysical properties. Nucleic Acids Res 2013; 41:W531-4. [PMID: 23620286 PMCID: PMC3692127 DOI: 10.1093/nar/gkt288] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2013] [Revised: 03/24/2013] [Accepted: 03/31/2013] [Indexed: 01/12/2023] Open
Abstract
The most important mechanism in the regulation of transcription is the binding of a transcription factor (TF) to a DNA sequence called the TF binding site (TFBS). Most binding sites are short and degenerate, which makes predictions based on their primary sequence alone somewhat unreliable. We present a new web tool that implements a flexible and extensible algorithm for predicting TFBS. The algorithm makes use of both direct (the sequence) and several indirect readout features of protein-DNA complexes (biophysical properties such as bendability or the solvent-excluded surface of the DNA). This algorithm significantly outperforms state-of-the-art approaches for in silico identification of TFBS. Users can submit FASTA sequences for analysis in the PhysBinder integrative algorithm and choose from >60 different TF-binding models. The results of this analysis can be used to plan and steer wet-lab experiments. The PhysBinder web tool is freely available at http://bioit.dmbr.ugent.be/physbinder/index.php.
Collapse
Affiliation(s)
- Stefan Broos
- Department for Molecular Biomedical Research, VIB and Department of Biomedical Molecular Biology, Ghent University, B-9052 Ghent, Belgium
| | - Arne Soete
- Department for Molecular Biomedical Research, VIB and Department of Biomedical Molecular Biology, Ghent University, B-9052 Ghent, Belgium
| | - Bart Hooghe
- Department for Molecular Biomedical Research, VIB and Department of Biomedical Molecular Biology, Ghent University, B-9052 Ghent, Belgium
| | - Raymond Moran
- Department for Molecular Biomedical Research, VIB and Department of Biomedical Molecular Biology, Ghent University, B-9052 Ghent, Belgium
| | - Frans van Roy
- Department for Molecular Biomedical Research, VIB and Department of Biomedical Molecular Biology, Ghent University, B-9052 Ghent, Belgium
| | - Pieter De Bleser
- Department for Molecular Biomedical Research, VIB and Department of Biomedical Molecular Biology, Ghent University, B-9052 Ghent, Belgium
| |
Collapse
|
26
|
Nowak-Lovato K, Alexandrov LB, Banisadr A, Bauer AL, Bishop AR, Usheva A, Mu F, Hong-Geller E, Rasmussen KØ, Hlavacek WS, Alexandrov BS. Binding of nucleoid-associated protein fis to DNA is regulated by DNA breathing dynamics. PLoS Comput Biol 2013; 9:e1002881. [PMID: 23341768 PMCID: PMC3547798 DOI: 10.1371/journal.pcbi.1002881] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2012] [Accepted: 11/29/2012] [Indexed: 12/23/2022] Open
Abstract
Physicochemical properties of DNA, such as shape, affect protein-DNA recognition. However, the properties of DNA that are most relevant for predicting the binding sites of particular transcription factors (TFs) or classes of TFs have yet to be fully understood. Here, using a model that accurately captures the melting behavior and breathing dynamics (spontaneous local openings of the double helix) of double-stranded DNA, we simulated the dynamics of known binding sites of the TF and nucleoid-associated protein Fis in Escherichia coli. Our study involves simulations of breathing dynamics, analysis of large published in vitro and genomic datasets, and targeted experimental tests of our predictions. Our simulation results and available in vitro binding data indicate a strong correlation between DNA breathing dynamics and Fis binding. Indeed, we can define an average DNA breathing profile that is characteristic of Fis binding sites. This profile is significantly enriched among the identified in vivo E. coli Fis binding sites. To test our understanding of how Fis binding is influenced by DNA breathing dynamics, we designed base-pair substitutions, mismatch, and methylation modifications of DNA regions that are known to interact (or not interact) with Fis. The goal in each case was to make the local DNA breathing dynamics either closer to or farther from the breathing profile characteristic of a strong Fis binding site. For the modified DNA segments, we found that Fis-DNA binding, as assessed by gel-shift assay, changed in accordance with our expectations. We conclude that Fis binding is associated with DNA breathing dynamics, which in turn may be regulated by various nucleotide modifications. Cellular transcription factors (TFs) are proteins that regulate gene expression, and thereby cellular activity and fate, by binding to specific DNA segments. The physicochemical determinants of protein-DNA binding specificity are not completely understood. Here, we report that the propensity of transient opening and re-closing of the double helix, resulting from thermal fluctuations, aka “DNA breathing” or “DNA bubbles,” can be associated with binding affinity in the case of Fis, a well-studied nucleoid-associated protein in Escherichia coli. We found that a particular breathing profile is characteristic of high-affinity Fis binding sites and that DNA fragments known to bind Fis in vivo are statistically enriched for this profile. Furthermore, we used simulations of DNA breathing dynamics to guide design of gel-shift experiments aimed at testing the idea that local breathing influences Fis binding. As a result, we show that via nucleotide modifications but without modifying nucleotides that directly contact Fis, we were able to transform a low-affinity Fis binding site into a high-affinity site and vice versa. The nucleotide modifications were designed only based on DNA breathing simulations. Our study suggests that strong Fis-DNA binding depends on DNA breathing - a novel physicochemical characteristic that could be used for prediction and rational design of TF binding sites.
Collapse
Affiliation(s)
- Kristy Nowak-Lovato
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Ludmil B. Alexandrov
- Cancer Genome Project, Wellcome Trust Sanger Institute, Cambridge, United Kingdom
| | - Afsheen Banisadr
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Amy L. Bauer
- X-Theoretical Design Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Alan R. Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Anny Usheva
- Harvard Medical School, Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
| | - Fangping Mu
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Elizabeth Hong-Geller
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - Kim Ø. Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
| | - William S. Hlavacek
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail: (WSH); (BSA)
| | - Boian S. Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico, United States of America
- * E-mail: (WSH); (BSA)
| |
Collapse
|