1
|
Li J, Chiu TP, Rohs R. Predicting DNA structure using a deep learning method. Nat Commun 2024; 15:1243. [PMID: 38336958 PMCID: PMC10858265 DOI: 10.1038/s41467-024-45191-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Accepted: 01/17/2024] [Indexed: 02/12/2024] Open
Abstract
Understanding the mechanisms of protein-DNA binding is critical in comprehending gene regulation. Three-dimensional DNA structure, also described as DNA shape, plays a key role in these mechanisms. In this study, we present a deep learning-based method, Deep DNAshape, that fundamentally changes the current k-mer based high-throughput prediction of DNA shape features by accurately accounting for the influence of extended flanking regions, without the need for extensive molecular simulations or structural biology experiments. By using the Deep DNAshape method, DNA structural features can be predicted for any length and number of DNA sequences in a high-throughput manner, providing an understanding of the effects of flanking regions on DNA structure in a target region of a sequence. The Deep DNAshape method provides access to the influence of distant flanking regions on a region of interest. Our findings reveal that DNA shape readout mechanisms of a core target are quantitatively affected by flanking regions, including extended flanking regions, providing valuable insights into the detailed structural readout mechanisms of protein-DNA binding. Furthermore, when incorporated in machine learning models, the features generated by Deep DNAshape improve the model prediction accuracy. Collectively, Deep DNAshape can serve as versatile and powerful tool for diverse DNA structure-related studies.
Collapse
Affiliation(s)
- Jinsen Li
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA
| | - Tsu-Pei Chiu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA
| | - Remo Rohs
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA.
- Department of Chemistry, University of Southern California, Los Angeles, CA, 90089, USA.
- Department of Physics and Astronomy, University of Southern California, Los Angeles, CA, 90089, USA.
- Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
2
|
Augustijn HE, Roseboom AM, Medema MH, van Wezel GP. Harnessing regulatory networks in Actinobacteria for natural product discovery. J Ind Microbiol Biotechnol 2024; 51:kuae011. [PMID: 38569653 PMCID: PMC10996143 DOI: 10.1093/jimb/kuae011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Accepted: 04/02/2024] [Indexed: 04/05/2024]
Abstract
Microbes typically live in complex habitats where they need to rapidly adapt to continuously changing growth conditions. To do so, they produce an astonishing array of natural products with diverse structures and functions. Actinobacteria stand out for their prolific production of bioactive molecules, including antibiotics, anticancer agents, antifungals, and immunosuppressants. Attention has been directed especially towards the identification of the compounds they produce and the mining of the large diversity of biosynthetic gene clusters (BGCs) in their genomes. However, the current return on investment in random screening for bioactive compounds is low, while it is hard to predict which of the millions of BGCs should be prioritized. Moreover, many of the BGCs for yet undiscovered natural products are silent or cryptic under laboratory growth conditions. To identify ways to prioritize and activate these BGCs, knowledge regarding the way their expression is controlled is crucial. Intricate regulatory networks control global gene expression in Actinobacteria, governed by a staggering number of up to 1000 transcription factors per strain. This review highlights recent advances in experimental and computational methods for characterizing and predicting transcription factor binding sites and their applications to guide natural product discovery. We propose that regulation-guided genome mining approaches will open new avenues toward eliciting the expression of BGCs, as well as prioritizing subsets of BGCs for expression using synthetic biology approaches. ONE-SENTENCE SUMMARY This review provides insights into advances in experimental and computational methods aimed at predicting transcription factor binding sites and their applications to guide natural product discovery.
Collapse
Affiliation(s)
- Hannah E Augustijn
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Anna M Roseboom
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Marnix H Medema
- Bioinformatics Group, Wageningen University, Wageningen, The Netherlands
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
| | - Gilles P van Wezel
- Molecular Biotechnology, Institute of Biology, Leiden University, Leiden, The Netherlands
- Netherlands Institute for Ecology (NIOO-KNAW), Wageningen, The Netherlands
| |
Collapse
|
3
|
Li J, Chiu TP, Rohs R. Deep DNAshape: Predicting DNA shape considering extended flanking regions using a deep learning method. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.22.563383. [PMID: 37961633 PMCID: PMC10634709 DOI: 10.1101/2023.10.22.563383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Understanding the mechanisms of protein-DNA binding is critical in comprehending gene regulation. Three-dimensional DNA shape plays a key role in these mechanisms. In this study, we present a deep learning-based method, Deep DNAshape, that fundamentally changes the current k -mer based high-throughput prediction of DNA shape features by accurately accounting for the influence of extended flanking regions, without the need for extensive molecular simulations or structural biology experiments. By using the Deep DNAshape method, refined DNA shape features can be predicted for any length and number of DNA sequences in a high-throughput manner, providing a deeper understanding of the effects of flanking regions on DNA shape in a target region of a sequence. Deep DNAshape method provides access to the influence of distant flanking regions on a region of interest. Our findings reveal that DNA shape readout mechanisms of a core target are quantitatively affected by flanking regions, including extended flanking regions, providing valuable insights into the detailed structural readout mechanisms of protein-DNA binding. Furthermore, when incorporated in machine learning models, the features generated by Deep DNAshape improve the model prediction accuracy. Collectively, Deep DNAshape can serve as a versatile and powerful tool for diverse DNA structure-related studies.
Collapse
|
4
|
PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences. BIOLOGY 2022; 11:biology11030418. [PMID: 35336792 PMCID: PMC8945605 DOI: 10.3390/biology11030418] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/24/2022] [Accepted: 03/07/2022] [Indexed: 01/14/2023]
Abstract
Simple Summary The family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains. Abstract The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.
Collapse
|
5
|
Srivastava D, Mahony S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2020; 1863:194443. [PMID: 31639474 PMCID: PMC7166147 DOI: 10.1016/j.bbagrm.2019.194443] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 09/21/2019] [Accepted: 10/06/2019] [Indexed: 12/14/2022]
Abstract
Transcription factors (TFs) selectively bind distinct sets of sites in different cell types. Such cell type-specific binding specificity is expected to result from interplay between the TF's intrinsic sequence preferences, cooperative interactions with other regulatory proteins, and cell type-specific chromatin landscapes. Cell type-specific TF binding events are highly correlated with patterns of chromatin accessibility and active histone modifications in the same cell type. However, since concurrent chromatin may itself be a consequence of TF binding, chromatin landscapes measured prior to TF activation provide more useful insights into how cell type-specific TF binding events became established in the first place. Here, we review the various sequence and chromatin determinants of cell type-specific TF binding specificity. We identify the current challenges and opportunities associated with computational approaches to characterizing, imputing, and predicting cell type-specific TF binding patterns. We further focus on studies that characterize TF binding in dynamic regulatory settings, and we discuss how these studies are leading to a more complex and nuanced understanding of dynamic protein-DNA binding activities. We propose that TF binding activities at individual sites can be viewed along a two-dimensional continuum of local sequence and chromatin context. Under this view, cell type-specific TF binding activities may result from either strongly favorable sequence features or strongly favorable chromatin context.
Collapse
Affiliation(s)
- Divyanshi Srivastava
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America
| | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America.
| |
Collapse
|
6
|
Anderson AP, Jones AG. erefinder: Genome-wide detection of oestrogen response elements. Mol Ecol Resour 2019; 19:1366-1373. [PMID: 31177626 DOI: 10.1111/1755-0998.13046] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2018] [Revised: 05/31/2019] [Accepted: 05/31/2019] [Indexed: 11/28/2022]
Abstract
Oestrogen response elements (EREs) are specific DNA sequences to which ligand-bound oestrogen receptors (ERs) physically bind, allowing them to act as transcription factors for target genes. Locating EREs and ER responsive regions is therefore a potentially important component of the study of oestrogen-regulated pathways. Here, we report the development of a novel software tool, erefinder, which conducts a genome-wide, sliding-window analysis of oestrogen receptor binding affinity. We demonstrate the effects of adjusting window size and highlight the program's general agreement with ChIP studies. We further provide two examples of how erefinder can be used for comparative approaches. erefinder can handle large input files, has settings to allow for broad and narrow searches, and provides the full output to allow for greater data manipulation. These features facilitate a wide range of hypothesis testing for researchers and make erefinder an excellent tool to aid in oestrogen-related research.
Collapse
Affiliation(s)
- Andrew P Anderson
- Department of Biology, Texas A&M University, College Station, TX, USA
| | - Adam G Jones
- Department of Biological Sciences, University of Idaho, Moscow, ID, USA
| |
Collapse
|
7
|
Abstract
Incorporating information about DNA structure can increase the reliability of predictions of transcription factor binding sites.
Collapse
Affiliation(s)
- Gary D Stormo
- Department of Genetics and Center for Genome Science and Systems Biology, Washington University School of Medicine, St. Louis, MO 63108, USA.
| | - Basab Roy
- Department of Genetics and Center for Genome Science and Systems Biology, Washington University School of Medicine, St. Louis, MO 63108, USA
| |
Collapse
|
8
|
Liu Q, Liu M, Wu W. Strong/Weak Feature Recognition of Promoters Based on Position Weight Matrix and Ensemble Set-Valued Models. J Comput Biol 2018; 25:1152-1160. [PMID: 29993261 DOI: 10.1089/cmb.2018.0067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
In this article, we propose a method to recognize the strong/weak property of the promoters based on the nucleotide sequence. To the best of our knowledge, it is the first time to predict the strong/weak property of the promoters. First, position weight matrix (PWM) is used to evaluate the contributions of the nucleotides to the promoter strength. Then, the set-valued model is used to describe the relation between the nucleotide sequence and the strength. Considering the small-sample and imbalance features of the promoter data, we propose an ensemble approach to predict the strong/weak property of the promoters. The proposed method is used to recognize 60 [Formula: see text] promoters of Escherichia coli. The results show the effectiveness of the proposed method. This article provides a simple way for a biologist to evaluate the strong/weak feature of promoters from the nucleotide sequence.
Collapse
Affiliation(s)
- Qie Liu
- Department of Automation, Tsinghua University , Beijing, China
| | - Min Liu
- Department of Automation, Tsinghua University , Beijing, China
| | - Wenfa Wu
- Department of Automation, Tsinghua University , Beijing, China
| |
Collapse
|
9
|
Zhang S, Li M, Ji H, Fang Z. Landscape of transcriptional deregulation in lung cancer. BMC Genomics 2018; 19:435. [PMID: 29866045 PMCID: PMC5987572 DOI: 10.1186/s12864-018-4828-1] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 05/25/2018] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Lung cancer is a very heterogeneous disease that can be pathologically classified into different subtypes including small-cell lung carcinoma (SCLC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and large-cell carcinoma (LCC). Although much progress has been made towards the oncogenic mechanism of each subtype, transcriptional circuits mediating the upstream signaling pathways and downstream functional consequences remain to be systematically studied. RESULTS Here we trained a one-class support vector machine (OC-SVM) model to establish a general transcription factor (TF) regulatory network containing 325 TFs and 18724 target genes. We then applied this network to lung cancer subtypes and identified those deregulated TFs and downstream targets. We found that the TP63/SOX2/DMRT3 module was specific to LUSC, corresponding to squamous epithelial differentiation and/or survival. Moreover, the LEF1/MSC module was specifically activated in LUAD and likely to confer epithelial-to-mesenchymal transition, known important for cancer malignant progression and metastasis. The proneural factor, ASCL1, was specifically up-regulated in SCLC which is known to have a neuroendocrine phenotype. Also, ID2 was differentially regulated between SCLC and LUSC, with its up-regulation in SCLC linking to energy supply for fast mitosis and its down-regulation in LUSC linking to the attenuation of immune response. We further described the landscape of TF regulation among the three major subtypes of lung cancer, highlighting their functional commonalities and specificities. CONCLUSIONS Our approach uncovered the landscape of transcriptional deregulation in lung cancer, and provided a useful resource of TF regulatory network for future studies.
Collapse
Affiliation(s)
- Shu Zhang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 People’s Republic of China
- State Key Laboratory of Cell Biology, Shanghai, China
- CAS Center for Excellence in Molecular Cell Science, Shanghai, China
- Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai, 200031 China
| | - Mingfa Li
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240 People’s Republic of China
| | - Hongbin Ji
- State Key Laboratory of Cell Biology, Shanghai, China
- CAS Center for Excellence in Molecular Cell Science, Shanghai, China
- Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai, 200031 China
- School of Life Science and Technology, Shanghai Tech University, Shanghai, 200120 China
| | - Zhaoyuan Fang
- State Key Laboratory of Cell Biology, Shanghai, China
- CAS Center for Excellence in Molecular Cell Science, Shanghai, China
- Innovation Center for Cell Signaling Network, Institute of Biochemistry and Cell Biology, Shanghai, 200031 China
- Shanghai Institutes for Biological Sciences, Chinese Academy of Science, Shanghai, 200031 China
| |
Collapse
|
10
|
Rube HT, Rastogi C, Kribelbauer JF, Bussemaker HJ. A unified approach for quantifying and interpreting DNA shape readout by transcription factors. Mol Syst Biol 2018; 14:e7902. [PMID: 29472273 PMCID: PMC5822049 DOI: 10.15252/msb.20177902] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2017] [Revised: 01/26/2018] [Accepted: 01/31/2018] [Indexed: 01/07/2023] Open
Abstract
Transcription factors (TFs) interpret DNA sequence by probing the chemical and structural properties of the nucleotide polymer. DNA shape is thought to enable a parsimonious representation of dependencies between nucleotide positions. Here, we propose a unified mathematical representation of the DNA sequence dependence of shape and TF binding, respectively, which simplifies and enhances analysis of shape readout. First, we demonstrate that linear models based on mononucleotide features alone account for 60-70% of the variance in minor groove width, roll, helix twist, and propeller twist. This explains why simple scoring matrices that ignore all dependencies between nucleotide positions can partially account for DNA shape readout by a TF Adding dinucleotide features as sequence-to-shape predictors to our model, we can almost perfectly explain the shape parameters. Building on this observation, we developed a post hoc analysis method that can be used to analyze any mechanism-agnostic protein-DNA binding model in terms of shape readout. Our insights provide an alternative strategy for using DNA shape information to enhance our understanding of how cis-regulatory codes are interpreted by the cellular machinery.
Collapse
Affiliation(s)
- H Tomas Rube
- Department of Biological Sciences, Columbia University, New York, NY, USA
| | - Chaitanya Rastogi
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Program in Applied Physics and Applied Mathematics, Columbia University, New York, NY, USA
| | - Judith F Kribelbauer
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| | - Harmen J Bussemaker
- Department of Biological Sciences, Columbia University, New York, NY, USA
- Department of Systems Biology, Columbia University Medical Center, New York, NY, USA
| |
Collapse
|
11
|
Li J, Sagendorf JM, Chiu TP, Pasi M, Perez A, Rohs R. Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Res 2018; 45:12877-12887. [PMID: 29165643 PMCID: PMC5728407 DOI: 10.1093/nar/gkx1145] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 10/30/2017] [Indexed: 12/18/2022] Open
Abstract
Uncovering the mechanisms that affect the binding specificity of transcription factors (TFs) is critical for understanding the principles of gene regulation. Although sequence-based models have been used successfully to predict TF binding specificities, we found that including DNA shape information in these models improved their accuracy and interpretability. Previously, we developed a method for modeling DNA binding specificities based on DNA shape features extracted from Monte Carlo (MC) simulations. Prediction accuracies of our models, however, have not yet been compared to accuracies of models incorporating DNA shape information extracted from X-ray crystallography (XRC) data or Molecular Dynamics (MD) simulations. Here, we integrated DNA shape information extracted from MC or MD simulations and XRC data into predictive models of TF binding and compared their performance. Models that incorporated structural information consistently showed improved performance over sequence-based models regardless of data source. Furthermore, we derived and validated nine additional DNA shape features beyond our original set of four features. The expanded repertoire of 13 distinct DNA shape features, including six intra-base pair and six inter-base pair parameters and minor groove width, is available in our R/Bioconductor package DNAshapeR and enables a comprehensive structural description of the double helix on a genome-wide scale.
Collapse
Affiliation(s)
- Jinsen Li
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Jared M Sagendorf
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Tsu-Pei Chiu
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Marco Pasi
- Centre for Biomolecular Sciences and School of Pharmacy, University of Nottingham, Nottingham NG7 2RD, UK
| | - Alberto Perez
- Laufer Center for Physical and Quantitative Biology, Stony Brook University, Stony Brook, NY 11794, USA
| | - Remo Rohs
- Computational Biology and Bioinformatics Program, Departments of Biological Sciences, Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
12
|
Batmanov K, Wang J. Predicting Variation of DNA Shape Preferences in Protein-DNA Interaction in Cancer Cells with a New Biophysical Model. Genes (Basel) 2017; 8:E233. [PMID: 28927002 PMCID: PMC5615366 DOI: 10.3390/genes8090233] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2017] [Revised: 09/13/2017] [Accepted: 09/13/2017] [Indexed: 11/30/2022] Open
Abstract
DNA shape readout is an important mechanism of transcription factor target site recognition, in addition to the sequence readout. Several machine learning-based models of transcription factor-DNA interactions, considering DNA shape features, have been developed in recent years. Here, we present a new biophysical model of protein-DNA interactions by integrating the DNA shape properties. It is based on the neighbor dinucleotide dependency model BayesPI2, where new parameters are restricted to a subspace spanned by the dinucleotide form of DNA shape features. This allows a biophysical interpretation of the new parameters as a position-dependent preference towards specific DNA shape features. Using the new model, we explore the variation of DNA shape preferences in several transcription factors across various cancer cell lines and cellular conditions. The results reveal that there are DNA shape variations at FOXA1 (Forkhead Box Protein A1) binding sites in steroid-treated MCF7 cells. The new biophysical model is useful for elucidating the finer details of transcription factor-DNA interaction, as well as for predicting cancer mutation effects in the future.
Collapse
Affiliation(s)
- Kirill Batmanov
- Department of Pathology, Oslo University Hospital-Norwegian Radium Hospital, Montebello, 0310 Oslo,Norway.
| | - Junbai Wang
- Department of Pathology, Oslo University Hospital-Norwegian Radium Hospital, Montebello, 0310 Oslo,Norway.
| |
Collapse
|
13
|
A computational model for predicting integrase catalytic domain of retrovirus. J Theor Biol 2017; 423:63-70. [PMID: 28454901 DOI: 10.1016/j.jtbi.2017.04.020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2017] [Revised: 04/01/2017] [Accepted: 04/21/2017] [Indexed: 11/23/2022]
Abstract
Integrase catalytic domain (ICD) is an essential part in the retrovirus for integration reaction, which enables its newly synthesized DNA to be incorporated into the DNA of infected cells. Owing to the crucial role of ICD for the retroviral replication and the absence of an equivalent of integrase in host cells, it is comprehensible that ICD is a promising drug target for therapeutic intervention. However, annotated ICDs in UniProtKB database have still been insufficient for a good understanding of their statistical characteristics so far. Accordingly, it is of great importance to put forward a computational ICD model in this work to annotate these domains in the retroviruses. The proposed model then discovered 11,660 new putative ICDs after scanning sequences without ICD annotations. Subsequently in order to provide much confidence in ICD prediction, it was tested under different cross-validation methods, compared with other database search tools, and verified on independent datasets. Furthermore, an evolutionary analysis performed on the annotated ICDs of retroviruses revealed a tight connection between ICD and retroviral classification. All the datasets involved in this paper and the application software tool of this model can be available for free download at https://sourceforge.net/projects/icdtool/files/?source=navbar.
Collapse
|
14
|
Sun S, Zhang X, Peng Q. A high-order representation and classification method for transcription factor binding sites recognition in Escherichia coli. Artif Intell Med 2017; 75:16-23. [PMID: 28363453 DOI: 10.1016/j.artmed.2016.11.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Accepted: 11/23/2016] [Indexed: 11/29/2022]
Abstract
BACKGROUND Identifying transcription factors binding sites (TFBSs) plays an important role in understanding gene regulatory processes. The underlying mechanism of the specific binding for transcription factors (TFs) is still poorly understood. Previous machine learning-based approaches to identifying TFBSs commonly map a known TFBS to a one-dimensional vector using its physicochemical properties. However, when the dimension-sample rate is large (i.e., number of dimensions/number of samples), concatenating different physicochemical properties to a one-dimensional vector not only is likely to lose some structural information, but also poses significant challenges to recognition methods. MATERIALS AND METHOD In this paper, we introduce a purely geometric representation method, tensor (also called multidimensional array), to represent TFs using their physicochemical properties. Accompanying the multidimensional array representation, we also develop a tensor-based recognition method, tensor partial least squares classifier (abbreviated as TPLSC). Intuitively, multidimensional arrays enable borrowing more information than one-dimensional arrays. The performance of each method is evaluated by average F-measure on 51 Escherichia coli TFs from RegulonDB database. RESULTS In our first experiment, the results show that multiple nucleotide properties can obtain more power than dinucleotide properties. In the second experiment, the results demonstrate that our method can gain increased prediction power, roughly 33% improvements more than the best result from existing methods. CONCLUSION The representation method for TFs is an important step in TFBSs recognition. We illustrate the benefits of this representation on real data application via a series of experiments. This method can gain further insights into the mechanism of TF binding and be of great use for metabolic engineering applications.
Collapse
Affiliation(s)
- Shiquan Sun
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China; Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, MI 48109, USA.
| | - Xiongpan Zhang
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China.
| | - Qinke Peng
- Systems Engineering Institute, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an, Shaanxi 710049, China.
| |
Collapse
|
15
|
Mathelier A, Xin B, Chiu TP, Yang L, Rohs R, Wasserman WW. DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo. Cell Syst 2016; 3:278-286.e4. [PMID: 27546793 PMCID: PMC5042832 DOI: 10.1016/j.cels.2016.07.001] [Citation(s) in RCA: 84] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2015] [Revised: 03/04/2016] [Accepted: 06/30/2016] [Indexed: 01/09/2023]
Abstract
Interactions of transcription factors (TFs) with DNA comprise a complex interplay between base-specific amino acid contacts and readout of DNA structure. Recent studies have highlighted the complementarity of DNA sequence and shape in modeling TF binding in vitro. Here, we have provided a comprehensive evaluation of in vivo datasets to assess the predictive power obtained by augmenting various DNA sequence-based models of TF binding sites (TFBSs) with DNA shape features (helix twist, minor groove width, propeller twist, and roll). Results from 400 human ChIP-seq datasets for 76 TFs show that combining DNA shape features with position-specific scoring matrix (PSSM) scores improves TFBS predictions. Improvement has also been observed using TF flexible models and a machine-learning approach using a binary encoding of nucleotides in lieu of PSSMs. Incorporating DNA shape information is most beneficial for E2F and MADS-domain TF families. Our findings indicate that incorporating DNA sequence and shape information benefits the modeling of TF binding under complex in vivo conditions.
Collapse
Affiliation(s)
- Anthony Mathelier
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, 980 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada; Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo and Oslo University Hospital, 0318 Oslo, Norway; Department of Cancer Genetics, Institute for Cancer Research, Oslo University Hospital Radiumhospitalet, 0372 Oslo, Norway
| | - Beibei Xin
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Tsu-Pei Chiu
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Lin Yang
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| | - Remo Rohs
- Molecular and Computational Biology Program, Departments of Biological Sciences, Chemistry, Physics, and Computer Science, University of Southern California, Los Angeles, CA 90089, USA.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics at the Child and Family Research Institute, Department of Medical Genetics, University of British Columbia, 980 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada.
| |
Collapse
|
16
|
Peng PC, Sinha S. Quantitative modeling of gene expression using DNA shape features of binding sites. Nucleic Acids Res 2016; 44:e120. [PMID: 27257066 PMCID: PMC5291265 DOI: 10.1093/nar/gkw446] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2015] [Revised: 05/06/2016] [Accepted: 05/09/2016] [Indexed: 12/11/2022] Open
Abstract
Prediction of gene expression levels driven by regulatory sequences is pivotal in genomic biology. A major focus in transcriptional regulation is sequence-to-expression modeling, which interprets the enhancer sequence based on transcription factor concentrations and DNA binding specificities and predicts precise gene expression levels in varying cellular contexts. Such models largely rely on the position weight matrix (PWM) model for DNA binding, and the effect of alternative models based on DNA shape remains unexplored. Here, we propose a statistical thermodynamics model of gene expression using DNA shape features of binding sites. We used rigorous methods to evaluate the fits of expression readouts of 37 enhancers regulating spatial gene expression patterns in Drosophila embryo, and show that DNA shape-based models perform arguably better than PWM-based models. We also observed DNA shape captures information complimentary to the PWM, in a way that is useful for expression modeling. Furthermore, we tested if combining shape and PWM-based features provides better predictions than using either binding model alone. Our work demonstrates that the increasingly popular DNA-binding models based on local DNA shape can be useful in sequence-to-expression modeling. It also provides a framework for future studies to predict gene expression better than with PWM models alone.
Collapse
Affiliation(s)
- Pei-Chen Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|