1
|
Zhuang J, Huang X, Liu S, Gao W, Su R, Feng K. MulTFBS: A Spatial-Temporal Network with Multichannels for Predicting Transcription Factor Binding Sites. J Chem Inf Model 2024; 64:4322-4333. [PMID: 38733561 DOI: 10.1021/acs.jcim.3c02088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]
Abstract
Revealing the mechanisms that influence transcription factor binding specificity is the key to understanding gene regulation. In previous studies, DNA double helix structure and one-hot embedding have been used successfully to design computational methods for predicting transcription factor binding sites (TFBSs). However, DNA sequence as a kind of biological language, the method of word embedding representation in natural language processing, has not been considered properly in TFBS prediction models. In our work, we integrate different types of features of DNA sequence to design a multichanneled deep learning framework, namely MulTFBS, in which independent one-hot encoding, word embedding encoding, which can incorporate contextual information and extract the global features of the sequences, and double helix three-dimensional structural features have been trained in different channels. To extract sequence high-level information effectively, in our deep learning framework, we select the spatial-temporal network by combining convolutional neural networks and bidirectional long short-term memory networks with attention mechanism. Compared with six state-of-the-art methods on 66 universal protein-binding microarray data sets of different transcription factors, MulTFBS performs best on all data sets in the regression tasks, with the average R2 of 0.698 and the average PCC of 0.833, which are 5.4% and 3.2% higher, respectively, than the suboptimal method CRPTS. In addition, we evaluate the classification performance of MulTFBS for distinguishing bound or unbound regions on TF ChIP-seq data. The results show that our framework also performs well in the TFBS classification tasks.
Collapse
Affiliation(s)
- Jujuan Zhuang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Xinru Huang
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Shuhan Liu
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Wanquan Gao
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Rui Su
- The School of Science, Dalian Maritime University, Dalian 116026, China
| | - Kexin Feng
- The School of Science, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
2
|
Chen N, Yu J, Liu Z, Meng L, Li X, Wong KC. Discovering DNA shape motifs with multiple DNA shape features: generalization, methods, and validation. Nucleic Acids Res 2024; 52:4137-4150. [PMID: 38572749 PMCID: PMC11077088 DOI: 10.1093/nar/gkae210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Revised: 03/06/2024] [Accepted: 03/12/2024] [Indexed: 04/05/2024] Open
Abstract
DNA motifs are crucial patterns in gene regulation. DNA-binding proteins (DBPs), including transcription factors, can bind to specific DNA motifs to regulate gene expression and other cellular activities. Past studies suggest that DNA shape features could be subtly involved in DNA-DBP interactions. Therefore, the shape motif annotations based on intrinsic DNA topology can deepen the understanding of DNA-DBP binding. Nevertheless, high-throughput tools for DNA shape motif discovery that incorporate multiple features altogether remain insufficient. To address it, we propose a series of methods to discover non-redundant DNA shape motifs with the generalization to multiple motifs in multiple shape features. Specifically, an existing Gibbs sampling method is generalized to multiple DNA motif discovery with multiple shape features. Meanwhile, an expectation-maximization (EM) method and a hybrid method coupling EM with Gibbs sampling are proposed and developed with promising performance, convergence capability, and efficiency. The discovered DNA shape motif instances reveal insights into low-signal ChIP-seq peak summits, complementing the existing sequence motif discovery works. Additionally, our modelling captures the potential interplays across multiple DNA shape features. We provide a valuable platform of tools for DNA shape motif discovery. An R package is built for open accessibility and long-lasting impact: https://zenodo.org/doi/10.5281/zenodo.10558980.
Collapse
Affiliation(s)
- Nanjun Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Jixiang Yu
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Zhe Liu
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Lingkuan Meng
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun City, Jilin Province, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Hong Kong Institute of Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
- Shenzhen Research Institute, City University of Hong Kong, Shenzhen, China
| |
Collapse
|
3
|
Zhang Q, Zhang Y, Wang S, Chen ZH, Gribova V, Filaretov VF, Huang DS. Predicting In-Vitro DNA-Protein Binding With a Spatially Aligned Fusion of Sequence and Shape. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3144-3153. [PMID: 34882561 DOI: 10.1109/tcbb.2021.3133869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Discovery of transcription factor binding sites (TFBSs) is of primary importance for understanding the underlying binding mechanic and gene regulation process. Growing evidence indicates that apart from the primary DNA sequences, DNA shape landscape has a significant influence on transcription factor binding preference. To effectively model the co-influence of sequence and shape features, we emphasize the importance of position information of sequence motif and shape pattern. In this paper, we propose a novel deep learning-based architecture, named hybridShape eDeepCNN, for TFBS prediction which integrates DNA sequence and shape information in a spatially aligned manner. Our model utilizes the power of the multi-layer convolutional neural network and constructs an independent subnetwork to adapt for the distinct data distribution of heterogeneous features. Besides, we explore the usage of continuous embedding vectors as the representation of DNA sequences. Based on the experiments on 20 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we demonstrate the superiority of our proposed method and validate the underlying design logic.
Collapse
|
4
|
Barissi S, Sala A, Wieczór M, Battistini F, Orozco M. DNAffinity: a machine-learning approach to predict DNA binding affinities of transcription factors. Nucleic Acids Res 2022; 50:9105-9114. [PMID: 36018808 PMCID: PMC9458447 DOI: 10.1093/nar/gkac708] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 07/21/2022] [Accepted: 08/08/2022] [Indexed: 12/24/2022] Open
Abstract
We present a physics-based machine learning approach to predict in vitro transcription factor binding affinities from structural and mechanical DNA properties directly derived from atomistic molecular dynamics simulations. The method is able to predict affinities obtained with techniques as different as uPBM, gcPBM and HT-SELEX with an excellent performance, much better than existing algorithms. Due to its nature, the method can be extended to epigenetic variants, mismatches, mutations, or any non-coding nucleobases. When complemented with chromatin structure information, our in vitro trained method provides also good estimates of in vivo binding sites in yeast.
Collapse
Affiliation(s)
| | | | - Miłosz Wieczór
- Institute for Research in Biomedicine (IRB Barcelona). The Barcelona Institute of Science and Technology. Baldiri Reixac 10–12, 08028 Barcelona, Spain,Department of Physical Chemistry. Gdansk University of Technology, 80-233 Gdańsk, Poland
| | | | - Modesto Orozco
- Correspondence may also be addressed to Modesto Orozco. Tel: +34 934 037 156;
| |
Collapse
|
5
|
Zhang Y, Mo Q, Xue L, Luo J. Evaluation of deep learning approaches for modeling transcription factor sequence specificity. Genomics 2021; 113:3774-3781. [PMID: 34534646 DOI: 10.1016/j.ygeno.2021.09.009] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Revised: 07/19/2021] [Accepted: 09/11/2021] [Indexed: 11/16/2022]
Abstract
As a key component of gene regulation, transcription factors (TFs) play an important role in a number of biological processes. To fully understand the underlying mechanism of TF-mediated gene regulation, it is therefore critical to accurately identify TF binding sites and predict their affinities. Recently, deep learning (DL) algorithms have achieved promising results in the prediction of DNA-TF binding, however, various deep learning architectures have not been systematically compared, and the relative merit of each architecture remains unclear. To address this problem, we applied four different deep learning architectures to SELEX-seq and HT-SELEX data, covering three species and 35 families. We evaluated and compared the performance of different deep neural models using 10-fold cross-validation. Our results indicate that the hybrid CNN + DNN model shows the best performances. We expect that our study will be broadly applicable to modeling and predicting TF binding specificity when more high-throughput affinity data are available.
Collapse
Affiliation(s)
- Yonglin Zhang
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, China
| | - Qi Mo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, China
| | - Li Xue
- School of Public Health, Southwest Medical University, Luzhou 646000, China
| | - Jiesi Luo
- Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou 646000, China; Department of Pharmacy, The Affiliated Hospital of Southwest Medical University, Luzhou 646000, China; Sichuan Key Medical Laboratory of New Drug Discovery and Druggability Evaluation, Luzhou Key Laboratory of Activity Screening and Druggability Evaluation for Chinese Materia Medica, Southwest Medical University, Luzhou 646000, China.
| |
Collapse
|
6
|
Zhang Y, Wang Z, Zeng Y, Zhou J, Zou Q. High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method. Brief Bioinform 2021; 22:6322761. [PMID: 34272562 DOI: 10.1093/bib/bbab273] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Revised: 06/19/2021] [Accepted: 06/25/2021] [Indexed: 11/14/2022] Open
Abstract
Transcription factors (TFs) are essential proteins in regulating the spatiotemporal expression of genes. It is crucial to infer the potential transcription factor binding sites (TFBSs) with high resolution to promote biology and realize precision medicine. Recently, deep learning-based models have shown exemplary performance in the prediction of TFBSs at the base-pair level. However, the previous models fail to integrate nucleotide position information and semantic information without noisy responses. Thus, there is still room for improvement. Moreover, both the inner mechanism and prediction results of these models are challenging to interpret. To this end, the Deep Attentive Encoder-Decoder Neural Network (D-AEDNet) is developed to identify the location of TFs-DNA binding sites in DNA sequences. In particular, our model adopts Skip Architecture to leverage the nucleotide position information in the encoder and removes noisy responses in the information fusion process by Attention Gate. Simultaneously, the Transcription Factor Motif Discovery based on Sliding Window (TF-MoDSW), an approach to discover TFs-DNA binding motifs by utilizing the output of neural networks, is proposed to understand the biological meaning of the predicted result. On ChIP-exo datasets, experimental results show that D-AEDNet has better performance than competing methods. Besides, we authenticate that Attention Gate can improve the interpretability of our model by ways of visualization analysis. Furthermore, we confirm that ability of D-AEDNet to learn TFs-DNA binding motifs outperform the state-of-the-art methods and availability of TF-MoDSW to discover biological sequence motifs in TFs-DNA interaction by conducting experiment on ChIP-seq datasets.
Collapse
Affiliation(s)
- Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Zixuan Wang
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Yuanqi Zeng
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Jiliu Zhou
- School of Computer Science, Chengdu University of Information Technology, 610225, Chengdu, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu, China
| |
Collapse
|
7
|
Zhang Q, Shen Z, Huang DS. Predicting in-vitro Transcription Factor Binding Sites Using DNA Sequence + Shape. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:667-676. [PMID: 31634140 DOI: 10.1109/tcbb.2019.2947461] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Discovery of transcription factor binding sites (TFBSs) is essential for understanding the underlying binding mechanisms and cellular functions. Recently, Convolutional neural network (CNN) has succeeded in predicting TFBSs from the primary DNA sequences. In addition to DNA sequences, several evidences suggest that protein-DNA binding is partly mediated by properties of DNA shape. Although many methods have been proposed to jointly account for DNA sequences and shape properties in predicting TFBSs, they ignore the power of the combination of deep learning and DNA sequence + shape. Therefore we develop a deep-learning-based sequence + shape framework (DLBSS) in this paper, which appropriately integrates DNA sequences and shape properties, to better understand protein-DNA binding preference. This method uses a shared CNN to find their common patterns from DNA sequences and their corresponding shape features, which are then concatenated to compute a predicted value. Using 66 in-vitro datasets derived from universal protein binding microarrays (uPBMs), we show that our proposed method DLBSS significantly improves the performance of predicting TFBSs. In addition, we explain the reason why we should use the shared CNN, and explore the performance of DLBSS when using a deeper CNN, through a series of experiments.
Collapse
|
8
|
Wang S, Zhang Q, Shen Z, He Y, Chen ZH, Li J, Huang DS. Predicting transcription factor binding sites using DNA shape features based on shared hybrid deep learning architecture. MOLECULAR THERAPY-NUCLEIC ACIDS 2021; 24:154-163. [PMID: 33767912 PMCID: PMC7972936 DOI: 10.1016/j.omtn.2021.02.014] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 02/14/2021] [Indexed: 12/26/2022]
Abstract
The study of transcriptional regulation is still difficult yet fundamental in molecular biology research. Recent research has shown that the double helix structure of nucleotides plays an important role in improving the accuracy and interpretability of transcription factor binding sites (TFBSs). Although several computational methods have been designed to take both DNA sequence and DNA shape features into consideration simultaneously, how to design an efficient model is still an intractable topic. In this paper, we proposed a hybrid convolutional recurrent neural network (CNN/RNN) architecture, CRPTS, to predict TFBSs by combining DNA sequence and DNA shape features. The novelty of our proposed method relies on three critical aspects: (1) the application of a shared hybrid CNN and RNN has the ability to efficiently extract features from large-scale genomic sequences obtained by high-throughput technology; (2) the common patterns were found from DNA sequences and their corresponding DNA shape features; (3) our proposed CRPTS can capture local structural information of DNA sequences without completely relying on DNA shape data. A series of comprehensive experiments on 66 in vitro datasets derived from universal protein binding microarrays (uPBMs) shows that our proposed method CRPTS obviously outperforms the state-of-the-art methods.
Collapse
Affiliation(s)
- Siguo Wang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Qinhu Zhang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China.,Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Tongji University, Siping Road 1239, Shanghai 200092, China
| | - Zhen Shen
- School of Computer and Software, Nanyang Institute of Technology, Changjiang Road 80, Nanyang, Henan 473004, China
| | - Ying He
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| | - Zhen-Heng Chen
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Jianqiang Li
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - De-Shuang Huang
- The Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, No. 4800 Caoan Road, Shanghai 201804, China
| |
Collapse
|
9
|
Jing F, Zhang SW, Cao Z, Zhang S. An Integrative Framework for Combining Sequence and Epigenomic Data to Predict Transcription Factor Binding Sites Using Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:355-364. [PMID: 30835229 DOI: 10.1109/tcbb.2019.2901789] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Knowing the transcription factor binding sites (TFBSs) is essential for modeling the underlying binding mechanisms and follow-up cellular functions. Convolutional neural networks (CNNs) have outperformed methods in predicting TFBSs from the primary DNA sequence. In addition to DNA sequences, histone modifications and chromatin accessibility are also important factors influencing their activity. They have been explored to predict TFBSs recently. However, current methods rarely take into account histone modifications and chromatin accessibility using CNN in an integrative framework. To this end, we developed a general CNN model to integrate these data for predicting TFBSs. We systematically benchmarked a series of architecture variants by changing network structure in terms of width and depth, and explored the effects of sample length at flanking regions. We evaluated the performance of the three types of data and their combinations using 256 ChIP-seq experiments and also compared it with competing machine learning methods. We find that contributions from these three types of data are complementary to each other. Moreover, the integrative CNN framework is superior to traditional machine learning methods with significant improvements.
Collapse
|
10
|
Dantas Machado AC, Cooper BH, Lei X, Di Felice R, Chen L, Rohs R. Landscape of DNA binding signatures of myocyte enhancer factor-2B reveals a unique interplay of base and shape readout. Nucleic Acids Res 2020; 48:8529-8544. [PMID: 32738045 PMCID: PMC7470950 DOI: 10.1093/nar/gkaa642] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 07/16/2020] [Accepted: 07/22/2020] [Indexed: 01/08/2023] Open
Abstract
Myocyte enhancer factor-2B (MEF2B) has the unique capability of binding to its DNA target sites with a degenerate motif, while still functioning as a gene-specific transcriptional regulator. Identifying its DNA targets is crucial given regulatory roles exerted by members of the MEF2 family and MEF2B's involvement in B-cell lymphoma. Analyzing structural data and SELEX-seq experimental results, we deduced the DNA sequence and shape determinants of MEF2B target sites on a high-throughput basis in vitro for wild-type and mutant proteins. Quantitative modeling of MEF2B binding affinities and computational simulations exposed the DNA readout mechanisms of MEF2B. The resulting binding signature of MEF2B revealed distinct intricacies of DNA recognition compared to other transcription factors. MEF2B uses base readout at its half-sites combined with shape readout at the center of its degenerate motif, where A-tract polarity dictates nuances of binding. The predominant role of shape readout at the center of the core motif, with most contacts formed in the minor groove, differs from previously observed protein-DNA readout modes. MEF2B, therefore, represents a unique protein for studies of the role of DNA shape in achieving binding specificity. MEF2B-DNA recognition mechanisms are likely representative for other members of the MEF2 family.
Collapse
Affiliation(s)
- Ana Carolina Dantas Machado
- Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Brendon H Cooper
- Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Xiao Lei
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Rosa Di Felice
- Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
- Department of Physics & Astronomy, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Chen
- Molecular and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
- Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
| | - Remo Rohs
- Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
- Department of Physics & Astronomy, University of Southern California, Los Angeles, CA 90089, USA
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
- Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA 90033, USA
- Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
11
|
Xu T, Zheng X, Li B, Jin P, Qin Z, Wu H. A comprehensive review of computational prediction of genome-wide features. Brief Bioinform 2020; 21:120-134. [PMID: 30462144 PMCID: PMC10233247 DOI: 10.1093/bib/bby110] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Revised: 10/15/2018] [Accepted: 10/16/2018] [Indexed: 12/15/2022] Open
Abstract
There are significant correlations among different types of genetic, genomic and epigenomic features within the genome. These correlations make the in silico feature prediction possible through statistical or machine learning models. With the accumulation of a vast amount of high-throughput data, feature prediction has gained significant interest lately, and a plethora of papers have been published in the past few years. Here we provide a comprehensive review on these published works, categorized by the prediction targets, including protein binding site, enhancer, DNA methylation, chromatin structure and gene expression. We also provide discussions on some important points and possible future directions.
Collapse
Affiliation(s)
- Tianlei Xu
- Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China
| | - Ben Li
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Peng Jin
- Department of Human Genetics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Zhaohui Qin
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, Atlanta, GA, USA
| |
Collapse
|
12
|
Pal S, Hoinka J, Przytycka TM. Co-SELECT reveals sequence non-specific contribution of DNA shape to transcription factor binding in vitro. Nucleic Acids Res 2020; 47:6632-6641. [PMID: 31226207 DOI: 10.1093/nar/gkz540] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Revised: 05/31/2019] [Accepted: 06/06/2019] [Indexed: 12/22/2022] Open
Abstract
Understanding the principles of DNA binding by transcription factors (TFs) is of primary importance for studying gene regulation. Recently, several lines of evidence suggested that both DNA sequence and shape contribute to TF binding. However, the following compelling question is yet to be considered: in the absence of any sequence similarity to the binding motif, can DNA shape still increase binding probability? To address this challenge, we developed Co-SELECT, a computational approach to analyze the results of in vitro HT-SELEX experiments for TF-DNA binding. Specifically, Co-SELECT leverages the presence of motif-free sequences in late HT-SELEX rounds and their enrichment in weak binders allows Co-SELECT to detect an evidence for the role of DNA shape features in TF binding. Our approach revealed that, even in the absence of the sequence motif, TFs have propensity to bind to DNA molecules of the shape consistent with the motif specific binding. This provides the first direct evidence that shape features that accompany the preferred sequence motifs also bestow an advantage for weak, sequence non-specific binding.
Collapse
Affiliation(s)
- Soumitra Pal
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Jan Hoinka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Teresa M Przytycka
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| |
Collapse
|
13
|
Pataskar A, Vanderlinden W, Emmerig J, Singh A, Lipfert J, Tiwari VK. Deciphering the Gene Regulatory Landscape Encoded in DNA Biophysical Features. iScience 2019; 21:638-649. [PMID: 31731201 PMCID: PMC6889597 DOI: 10.1016/j.isci.2019.10.055] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2019] [Revised: 10/20/2019] [Accepted: 10/24/2019] [Indexed: 01/24/2023] Open
Abstract
Gene regulation in higher organisms involves a sophisticated interplay between genetic and epigenetic mechanisms. Despite advances, the logic in selective usage of certain genomic regions as regulatory elements remains unclear. Here we show that the inherent biophysical properties of the DNA encode epigenetic state and the underlying regulatory potential. We find that the propeller twist (ProT) level is indicative of genomic location of the regulatory elements, their strength, the affinity landscape of transcription factors, and distribution in the nuclear 3D space. We experimentally show that ProT levels confer increased DNA flexibility and surface accessibility, and thus potentially primes usage of high ProT regions as regulatory elements. ProT levels also correlate with occurrence and phenotypic consequences of mutations. Interestingly, cell-fate switches involve a transient usage of low ProT regulatory elements. Altogether, our work provides unprecedented insights into the gene regulatory landscape encoded in the DNA biophysical features. DNA shape features encode genomic surface accessibility and flexibility High ProT is a deterministic feature of enhancers ProT levels correlate with nuclear organization of epigenetic states Cell-fate switches involve a transient usage of low ProT regulatory elements
Collapse
Affiliation(s)
- Abhijeet Pataskar
- Netherlands Cancer Institute, Amsterdam, the Netherlands; Former Address: Institute of Molecular Biology, 55128 Mainz, Germany
| | - Willem Vanderlinden
- Department of Physics and Center for NanoScience, LMU Munich, 80799 Munich, Germany
| | - Johannes Emmerig
- Department of Physics and Center for NanoScience, LMU Munich, 80799 Munich, Germany
| | - Aditi Singh
- Wellcome-Wolfson Institute for Experimental Medicine, School of Medicine, Dentistry & Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK
| | - Jan Lipfert
- Department of Physics and Center for NanoScience, LMU Munich, 80799 Munich, Germany
| | - Vijay K Tiwari
- Wellcome-Wolfson Institute for Experimental Medicine, School of Medicine, Dentistry & Biomedical Science, Queens University Belfast, Belfast BT9 7BL, UK; Former Address: Institute of Molecular Biology, 55128 Mainz, Germany.
| |
Collapse
|
14
|
Samee MAH, Bruneau BG, Pollard KS. A De Novo Shape Motif Discovery Algorithm Reveals Preferences of Transcription Factors for DNA Shape Beyond Sequence Motifs. Cell Syst 2019; 8:27-42.e6. [PMID: 30660610 PMCID: PMC6368855 DOI: 10.1016/j.cels.2018.12.001] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2018] [Revised: 08/18/2018] [Accepted: 12/03/2018] [Indexed: 12/17/2022]
Abstract
DNA shape adds specificity to sequence motifs but has not been explored systematically outside this context. We hypothesized that DNA-binding proteins (DBPs) preferentially occupy DNA with specific structures ("shape motifs") regardless of whether or not these correspond to high information content sequence motifs. We present ShapeMF, a Gibbs sampling algorithm that identifies de novo shape motifs. Using binding data from hundreds of in vivo and in vitro experiments, we show that most DBPs have shape motifs and can occupy these in the absence of sequence motifs. This "shape-only binding" is common for many DBPs and in regions co-bound by multiple DBPs. When shape and sequence motifs co-occur, they can be overlapping, flanking, or separated by consistent spacing. Finally, DBPs within the same protein family have different shape motifs, explaining their distinct genome-wide occupancy despite having similar sequence motifs. These results suggest that shape motifs not only complement sequence motifs but also facilitate recognition of DNA beyond conventionally defined sequence motifs.
Collapse
Affiliation(s)
| | - Benoit G Bruneau
- Gladstone Institutes, San Francisco, CA 94158, USA; Department of Pediatrics and Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Katherine S Pollard
- Gladstone Institutes, San Francisco, CA 94158, USA; Department of Epidemiology & Biostatistics, Institute for Human Genetics, Quantitative Biology Institute, and Institute for Computational Health Sciences, University of California, San Francisco, San Francisco, CA 94158, USA; Chan-Zuckerberg Biohub, San Francisco, CA 94158, USA.
| |
Collapse
|
15
|
Rogers JM, Bulyk ML. Diversification of transcription factor-DNA interactions and the evolution of gene regulatory networks. WILEY INTERDISCIPLINARY REVIEWS. SYSTEMS BIOLOGY AND MEDICINE 2018; 10:e1423. [PMID: 29694718 PMCID: PMC6202284 DOI: 10.1002/wsbm.1423] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 02/23/2018] [Accepted: 03/11/2018] [Indexed: 01/17/2023]
Abstract
Sequence-specific transcription factors (TFs) bind short DNA sequences in the genome to regulate the expression of target genes. In the last decade, numerous technical advances have enabled the determination of the DNA-binding specificities of many of these factors. Large-scale screens of many TFs enabled the creation of databases of TF DNA-binding specificities, typically represented as position weight matrices (PWMs). Although great progress has been made in determining and predicting binding specificities systematically, there are still many surprises to be found when studying a particular TF's interactions with DNA in detail. Paralogous TFs' binding specificities can differ in subtle ways, in a manner that is not immediately apparent from looking at their PWMs. These differences affect gene regulatory outputs and enable TFs to rewire transcriptional networks over evolutionary time. This review discusses recent observations made in the study of TF-DNA interactions that highlight the importance of continued in-depth analysis of TF-DNA interactions and their inherent complexity. This article is categorized under: Biological Mechanisms > Regulatory Biology.
Collapse
Affiliation(s)
- Julia M. Rogers
- Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, 02115, USA,Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA, 02138, USA
| | - Martha L. Bulyk
- Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, 02115, USA,Committee on Higher Degrees in Biophysics, Harvard University, Cambridge, MA, 02138, USA,Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, 02115, USA
| |
Collapse
|
16
|
Rube HT, Rastogi C, Kribelbauer JF, Bussemaker HJ. A unified approach for quantifying and interpreting DNA shape readout by transcription factors. Mol Syst Biol 2018; 14:e7902. [PMID: 29472273 PMCID: PMC5822049 DOI: 10.15252/msb.20177902] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 01/26/2018] [Accepted: 01/31/2018] [Indexed: 01/07/2023] Open
Abstract
Transcription factors (TFs) interpret DNA sequence by probing the chemical and structural properties of the nucleotide polymer. DNA shape is thought to enable a parsimonious representation of dependencies between nucleotide positions. Here, we propose a unified mathematical representation of the DNA sequence dependence of shape and TF binding, respectively, which simplifies and enhances analysis of shape readout. First, we demonstrate that linear models based on mononucleotide features alone account for 60-70% of the variance in minor groove width, roll, helix twist, and propeller twist. This explains why simple scoring matrices that ignore all dependencies between nucleotide positions can partially account for DNA shape readout by a TF Adding dinucleotide features as sequence-to-shape predictors to our model, we can almost perfectly explain the shape parameters. Building on this observation, we developed a post hoc analysis method that can be used to analyze any mechanism-agnostic protein-DNA binding model in terms of shape readout. Our insights provide an alternative strategy for using DNA shape information to enhance our understanding of how cis-regulatory codes are interpreted by the cellular machinery.
Collapse
Affiliation(s)
- H Tomas Rube
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Chaitanya Rastogi
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Program in Applied Physics and Applied Mathematics, Columbia University, New York, NY, USA
| | - Judith F Kribelbauer
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| |
Collapse
|
17
|
Li J, Sagendorf JM, Chiu TP, Pasi M, Perez A, Rohs R. Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Res 2018; 45:12877-12887. [PMID: 29165643 PMCID: PMC5728407 DOI: 10.1093/nar/gkx1145] [Citation(s) in RCA: 62] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 10/30/2017] [Indexed: 12/18/2022] Open
Abstract
Uncovering the mechanisms that affect the binding specificity of transcription factors (TFs) is critical for understanding the principles of gene regulation. Although sequence-based models have been used successfully to predict TF binding specificities, we found that including DNA shape information in these models improved their accuracy and interpretability. Previously, we developed a method for modeling DNA binding specificities based on DNA shape features extracted from Monte Carlo (MC) simulations. Prediction accuracies of our models, however, have not yet been compared to accuracies of models incorporating DNA shape information extracted from X-ray crystallography (XRC) data or Molecular Dynamics (MD) simulations. Here, we integrated DNA shape information extracted from MC or MD simulations and XRC data into predictive models of TF binding and compared their performance. Models that incorporated structural information consistently showed improved performance over sequence-based models regardless of data source. Furthermore, we derived and validated nine additional DNA shape features beyond our original set of four features. The expanded repertoire of 13 distinct DNA shape features, including six intra-base pair and six inter-base pair parameters and minor groove width, is available in our R/Bioconductor package DNAshapeR and enables a comprehensive structural description of the double helix on a genome-wide scale.
Collapse
Affiliation(s)
- Jinsen Li
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Jared M Sagendorf
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Tsu-Pei Chiu
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Marco Pasi
- Centre for Biomolecular Sciences and School of Pharmacy, University of Nottingham, Nottingham NG7 2RD, UK
| | - Alberto Perez
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794, USA
| | - Remo Rohs
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
18
|
Kakumanu A, Velasco S, Mazzoni E, Mahony S. Deconvolving sequence features that discriminate between overlapping regulatory annotations. PLoS Comput Biol 2017; 13:e1005795. [PMID: 29049320 PMCID: PMC5663517 DOI: 10.1371/journal.pcbi.1005795] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2017] [Revised: 10/31/2017] [Accepted: 09/26/2017] [Indexed: 11/19/2022] Open
Abstract
Genomic loci with regulatory potential can be annotated with various properties. For example, genomic sites bound by a given transcription factor (TF) can be divided according to whether they are proximal or distal to known promoters. Sites can be further labeled according to the cell types and conditions in which they are active. Given such a collection of labeled sites, it is natural to ask what sequence features are associated with each annotation label. However, discovering such label-specific sequence features is often confounded by overlaps between the labels; e.g. if regulatory sites specific to a given cell type are also more likely to be promoter-proximal, it is difficult to assess whether motifs identified in that set of sites are associated with the cell type or associated with promoters. In order to meet this challenge, we developed SeqUnwinder, a principled approach to deconvolving interpretable discriminative sequence features associated with overlapping annotation labels. We demonstrate the novel analysis abilities of SeqUnwinder using three examples. Firstly, SeqUnwinder is able to unravel sequence features associated with the dynamic binding behavior of TFs during motor neuron programming from features associated with chromatin state in the initial embryonic stem cells. Secondly, we characterize distinct sequence properties of multi-condition and cell-specific TF binding sites after controlling for uneven associations with promoter proximity. Finally, we demonstrate the scalability of SeqUnwinder to discover cell-specific sequence features from over one hundred thousand genomic loci that display DNase I hypersensitivity in one or more ENCODE cell lines. Transcription factor proteins control gene expression by recognizing and interacting with short DNA sequence patterns in regulatory regions on the genome. Current genomics experiments allow us to find regulatory regions associated with a particular biochemical activity over the entire genome; for example, all regions where a particular transcription factor interacts with the genome in a given cell type. Given a collection of regulatory regions, we often aim to discover short DNA sequence patterns that are more common in the collection than in other regions. Performing such “DNA motif-finding” analysis can give us hints about the patterns that determine gene regulation in the analyzed cell type. Here we describe a new method for DNA motif-finding called SeqUnwinder. Our approach analyzes collections of regulatory regions where each has been labeled according to various biological properties. For example, the labels could correspond to various cell types in which the regulatory region is active. SeqUnwinder then performs machine-learning analysis to unravel DNA sequence features that are characteristic of each label (e.g. features that distinguish regulatory regions in each cell type from other cell types). SeqUnwinder is the first method to enable analysis of regulatory region collections that contain several overlapping labels.
Collapse
Affiliation(s)
- Akshay Kakumanu
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America
| | - Silvia Velasco
- Department of Biology, New York University, 100 Washington Square East, New York, NY, United States of America
| | - Esteban Mazzoni
- Department of Biology, New York University, 100 Washington Square East, New York, NY, United States of America
| | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America
- * E-mail:
| |
Collapse
|
19
|
Andrabi M, Hutchins AP, Miranda-Saavedra D, Kono H, Nussinov R, Mizuguchi K, Ahmad S. Predicting conformational ensembles and genome-wide transcription factor binding sites from DNA sequences. Sci Rep 2017; 7:4071. [PMID: 28642456 PMCID: PMC5481346 DOI: 10.1038/s41598-017-03199-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2016] [Accepted: 04/26/2017] [Indexed: 12/24/2022] Open
Abstract
DNA shape is emerging as an important determinant of transcription factor binding beyond just the DNA sequence. The only tool for large scale DNA shape estimates, DNAshape was derived from Monte-Carlo simulations and predicts four broad and static DNA shape features, Propeller twist, Helical twist, Minor groove width and Roll. The contributions of other shape features e.g. Shift, Slide and Opening cannot be evaluated using DNAshape. Here, we report a novel method DynaSeq, which predicts molecular dynamics-derived ensembles of a more exhaustive set of DNA shape features. We compared the DNAshape and DynaSeq predictions for the common features and applied both to predict the genome-wide binding sites of 1312 TFs available from protein interaction quantification (PIQ) data. The results indicate a good agreement between the two methods for the common shape features and point to advantages in using DynaSeq. Predictive models employing ensembles from individual conformational parameters revealed that base-pair opening - known to be important in strand separation - was the best predictor of transcription factor-binding sites (TFBS) followed by features employed by DNAshape. Of note, TFBS could be predicted not only from the features at the target motif sites, but also from those as far as 200 nucleotides away from the motif.
Collapse
Affiliation(s)
- Munazah Andrabi
- National Institutes of Biomedical Innovation Health and Nutrition, 7-6-8, Saito-Asagi, Ibaraki, Osaka, 5670085, Japan
- Faculty of Biology,Medicine and Health, Michael Smith Building, The University of Manchester, Dover Street, Manchester, M13 9PT, UK
| | - Andrew Paul Hutchins
- Department of Biology, Southern University of Science and Technology of China, Shenzhen, 518055, China
| | - Diego Miranda-Saavedra
- World Premier International (WPI) Immunology Frontier Research Center (IFReC), Osaka University, 3-1 Yamadaoka, Suita, 565-0871, Osaka, Japan
- Centro de Biología Molecular Severo Ochoa, CSIC/Universidad Autónoma de Madrid, 28049, Madrid, Spain
- Department of Computer Science, University of Oxford Wolfson Building, Parks Road, OXFORD, OX1 3QD, United Kingdom
| | - Hidetoshi Kono
- Molecular Modeling and Simulation (MMS) Group, National Institutes for Quantum and Radiological Science and Technology, 8-1-7, Umemidai, Kizugawa, Kyoto, 619-0215, Japan
| | - Ruth Nussinov
- National Cancer Institute, Cancer and Inflammation Program, Leidos Biomedical Research, Inc. Frederick, Maryland, USA
- Department of Biochemistry and Human Genetics, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Kenji Mizuguchi
- National Institutes of Biomedical Innovation Health and Nutrition, 7-6-8, Saito-Asagi, Ibaraki, Osaka, 5670085, Japan
| | - Shandar Ahmad
- National Institutes of Biomedical Innovation Health and Nutrition, 7-6-8, Saito-Asagi, Ibaraki, Osaka, 5670085, Japan.
- School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Mehrauli Road, New Delhi, 110067, India.
| |
Collapse
|